03 Multiple Regression

Multiple Regression
The SRM we introduced earlier is inadequate for most applications in economics. This is so
because much of the theory involved in economics suggests more than one explanatory variable
in the determination of the dependent variable. Thus, one dependent variable (Y) is explained by
changes in several variables (X1, X2, ..., Xk). Such a relationship is described by what is known as
multiple linear regression, which is written as:
= + + + ⋯+ + = 1, 2, … , 1
Note that X1, is set to be equal to one. This allows the equation to have an intercept. Thus, we
have k - 1 slope parameters, as long as we have the explanatory variables in levels. The equation
above holds for each observation. Suppose we have n observations on (Y, X1, X2, ..., Xk). Then
letting
1 …

1 …
= , = , = and ∈= ⋮
⋮ ⋮ ⋮ ⋱ ⋮ ⋮

1 …
We can write each observation in vector form as follows
+ + +⋯+ +
+ + +⋯+ +
=
⋮ ⋮
+ + +⋯+ +
1 …

1 …
= + ⋮
⋮ ⋮ ⋱ ⋮ ⋮

1 …
This can be expressed, in matrix notation in the following manner:
Yn x 1 = Xn x k βk x1 +∈ 2
Our objective is to obtain estimators for the vector β.
Assumptions
In order to address the generalisation we introduced in the multiple regression model, the
classical assumptions we introduced for the simple linear regression model are modified as
follows.
1. the model is linear
= + + + ⋯+ + = 1, 2, … ,
1
2. E[ ]= 0 => E[Y] = Xβ
3. E[ ′] = σ2I
Note that this (3) implies that
a)the variance is constant and
b) the covariance between any two different explanatory variables is zero.
This follows from the fact that
⋯
⎡ ⎤
′] ⋯ ⎢ ⋯ ⎥
[ = ⋮ [ ]= ⋱ ⋮ ⎥
⎢ ⋮ ⋮
⎣ ⋯ ⎦
[ ] [ ] ⋯ [ ] 0 ⋯ 0
⎡ ⎤
[ ] [ ] ⋯ [ ]⎥ ⋯ 0
=⎢ ⋱ = 0 ⋱ ⋮
⎢ ⋮ ⋮ ⋮ ⎥ ⋮ ⋮
⎣ [ ] [ ] ⋯ [ ] ⎦ 0 0 ⋯
= I
4. R(X) = k. Here we are saying that the rank of the matrix X is k, which is necessary for
obtaining a solution for the problem.
5. X is a non stochastic matrix.
6. ~ (0, )
The three methods of estimation mentioned in the simple regression method are applicable in this
case as well. However, we will discuss only the method of the least squares. This does not mean
that you should not know about the other methods. Try to do that!!!
The method of Least Squares
Let be any vector estimators of the elements in the vector β and e be the vector of residuals.
Then
= +
= − 3
The method of least squares implies a choice of a vector that minimises the sum of squares of
the residuals, e'e. Now observe that
2
′
=[ ⋯ ] ⋮ = + +⋯+ =∑ .
Note also that is a function of , since = −

Define
=∑ 4
The necessary condition for minimization, in vector form, is then given as
⎡ ⎤
⎢ ⎥ 0
0
=⎢ ⎥= ⋮ ,
⎢ ⋮ ⎥
⎢ ⎥ 0
⎣ ⎦
or, in summation form
=∑ ( ) = 2∑ =0
Now, = − − − ⋯−
Thus,
= −1
=−
=−
or in matrix notation
1 1 ⋯ 1
⋯ ′
= ⋮
⋮ ⋱ ⋮ =
⋯
X' can be written in vector notation as
′ ′
=
Given this notation, the model,
3
= + + + ⋯+ + = 1, 2, … ,
can be written in the following form, if we take each element in the column
= [1 ⋯ ] + = 1, 2, … , ≡ = +
⋮
Where
In this representation,
1
= ⋮ ∀ = 1,2, … ,
Therefore, in summation notation
= 2 ∑ (− ) for all k at the minimizing level of estimates
Thus, at the minimizing level,
∑ =0
for all i and for all k
This implies that
=0→∑ =0
=0→∑ =0
=0→∑ =0
The first is true because Xi1 = 1. Note also because of this, ̅ = 0, and that ∑ = 0 implies
that the covariance between ei and Xik is zero. The latter, in turn imply that the least squares
residuals and the explanatory variables are uncorrelated. These are the normal equations of the
problem and the solution to this problem gives us the LS estimates for β.
The solution to this problem is even simpler if we proceed in the matrix form of representing it.
We proceed to do that as follows. Given ∑ = 0 it follow that:
4
0
[ ⋯ ] ⋮ = 0⋮ .
0
This is then written in matrix notation as
=
Now
= −
Therefore
= ′ − =
′ − ′ =
= ′
( ′ ) =( ′ ) ′
=( ′ )
Thus, given the model
Y = Xβ + ∈
The LS Normal equations are
=
and the LS estimators are given by
=( ′ )
Note: the dimensions of each matrix is as follows
X is (n x k) => X' is (k x n), therefore X'X is (k x k)
^
X' is (k x n) and Y is (n x 1) therefore X'Y is (k x 1) which is also the dimension of .
The assumption that R(X) = k is needed because if that is so, it follows that the rank of X'X will
also be k This, in turn will be needed, if we are to have a solution to our problem, since if this
condition does not hold then the inverse of X'X will not exist. As a result we shall not have any
meaningful solution to the problem. The rank condition implies that the k columns of X are
linearly independent. No column in this matrix is an exact linear combination of the others. No
5
explanatory variable is an exact linear function of the other explanatory variables. In
econometrics we say that there is no exact multicollinearity in the model.
Note that the matrix X'X is a symmetric square matrix equal to the following
∑ ∑ … ∑
⎡ ⎤
⎢⋮ ∑ ∑ … ∑ ⎥
=⎢ ⋮ … ∑ … ∑ ⎥
⎢ … … ⎥
⎢ ⋮ ⋱ ⋮ ⎥
⎣ ⋮ … … … ∑ ⎦
and X'Y is
∑
∑
=
⋮
∑
We can derive the specified estimators if we have specified number of explanatory variables. We
shall do this for the case of two explanatory variables. You are advised to do likewise for the
case of three explanatory variables as your assignment. You are also asked to show that the
results obtained for the simple linear regression model can be obtained using the matrix
formulation. When we have only two variables the model reduces to
= + + + = 1, 2, … ,
Notice that k = 3 in this case.
To transform this into matrix notation we see that
1 …

1 …
= , = , = and = ⋮
⋮ ⋮ ⋮ ⋱ ⋮ ⋮

1 …
1 ∑ ∑
1 1 ⋯ 1
1
= ⋯ ∑ ∑ ∑
⋮ ⋮ ⋮
⋯ ∑ ∑ ∑
1
and
1 1 ⋯ 1 ∑
= ⋯ = ∑
⋮
⋯ ∑
Then the normal equations boil down to
6
∑ ∑ ∑
∑ ∑ ∑ = ∑
∑ ∑ ∑ ∑
Whereby multiplying out the whole equation one gets

+ ∑ + ∑ =∑
∑ + ∑ + ∑ =∑
∑ + ∑ + ∑ =∑
Dividing the first normal equation by n (the number of observations) one gets
∑ ∑ ∑
= + +
= + +
= − −
Substituting this result into the remaining two normal equations we get
− − ∑ + ∑ + ∑ =∑
− − ∑ + ∑ + ∑ =∑
which lead equal to
∑ − ∑ − ∑ + ∑ + ∑ =∑
∑ − ∑ − ∑ + ∑ + ∑ =∑
Rearranging these equations one gets
(∑ − )+ (∑ − )=∑ −
(∑ − )+ (∑ − )=∑ −
Which boil down to
∑ + ∑ =∑
∑ + ∑ =∑
Now, let
=∑ ∀ , = 2,3
=∑ ∀ , = 2,3
The normal equations then boil down to
+ =
7
+ =
This is a simultaneous linear equation which can be solved in three different ways:
Remember!!!!
1. by substitution
2. by matrix inversion
3. by Cramer's rule
I will use the matrix inversion method: Homework: use the other methods and solve the
equations.
We can rewrite the normal equations
+ =
+ =
in matrix form as
+ =
Thus
Thus, we need to obtain the inverse of the matrix of Slj and post-multiply it with that of Sly. We
first obtain the inverse of Slj, for which we should start by getting the determinant of S, as
follows
| |= = −
Replacing each element by its cofactor and transposing the matrix we get
−
−
Multiplying this matrix by the inverse of its determinant we get
−
−
8
Which is the inverse of the matrix S. Now post-multiplying this by Sly we get
−
=
−
Regression in deviation form

It is usually simpler to tackle estimation of the parameters in a multiple regression model via the
deviation of variables from their mean. Given the model:
= + + + ⋯+ + = 1, 2, … ,
= + + ⋯+ + ̅
Subtract the mean equation from the regression equation to get
− = + ( − )+ ( − ) + ⋯+ ( − )+ −∈ ∀ = 1, 2, … ,
= + + +⋯+ + ∀ = 1, 2, … ,
Where, small letters indicates the deviation from the mean for the respective variable and
= − .̅
Then it follows that
= − − −⋯− ∀ = 1, 2, … ,
∑ = ∑( − − − ⋯− ) ∀ = 1, 2, … ,
To derive the OLS estimators we replace the error terms with their corresponding residuals, the
parameters with their corresponding estimates and obtain the FOC's for minimizing this problem.
That is, write
∑ ̂ =∑ − − − ⋯− ∀ = 1, 2, … ,
and take the partial derivatives of this with respect to βj, j= 2, ..., k, and equate each derivative to
zero to obtain the FOC's. That is,
∑
= 0 → 2∑ ̂ = 0∀ = 2, … ,
Now
= ; ∀ = 2, … ,
Therefore,
9
2∑ ̂ = 0 implies that
2∑ − − − ⋯− ( )=0
2∑ − − − ⋯− ( )=0
⋮
2∑ − − − ⋯− ( )=0
This boils down to the following system of equations
2∑ = ∑ + ∑ + ⋯+ ∑
2∑ = ∑ − ∑ − ⋯− ∑ 0
⋮
∑ = ∑ + ∑ +⋯+ ∑
This could then be written as follows in matrix notation
x'y=(x'x)
where
…
… ⎡ ⎤
= ⋮ , = ⋮ ⋮ ⋱ ⋮ , =⎢ ⎥
⎢⋮⎥
… ⎣ ⎦
It then follows that

∑ ∑ … ∑
⎡ ⎤
∑ ∑ … ∑
′ =⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣∑ ∑ … ∑ ⎦
and
∑
∑
=
⋮
∑
Therefore, since
x'y=(x'x)
it follows that
10
=(x'x)-1x'y
An illustrative example
Let Y = Annual salary (in thousands of Birr)
X2 = Years of education past high school
X3 = Years of experience with the firm.
We have the following data on 5 randomly selected individuals working in this specific firm.
We shall use the three different methods of obtaining the estimated parameters. For the first two
methods we need to obtain the sum, sum of squares and the sum of cross products for the
explanatory variables, and the sum of the dependent variable, as well as the sum of cross
products of the dependent and independent variables. This we obtain as follows:
X 22i X 32i
Obs Yi X2i X3i X2i X3i Yi X2i Yi X3i
1 30 4 10 16 100 40 120 300
2 20 3 8 9 64 24 60 160
3 36 6 11 36 121 66 216 396
4 24 4 9 16 81 36 96 216
5 40 8 12 64 144 96 320 480
150 25 50 141 510 262 812 1552
Thus, the normal equations of this problem reduce to
5 + 25 + 50 = 150
25 + 141 + 262 = 812
50 + 262 + 510 = 1552
Solving this using matrix manipulation
The equations can be written in matrix notation as
∑ ∑ ∑
= ∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
5 25 50 150
= 25 141 261 812
50 262 510 1552
To obtain the (X'X)-1 we first solve for its determinant as
11
5 25 50
| | = 25 141 261
50 262 510
= (5 × 141 × 510) + (25 × 262 × 50) + (50 × 25 × 262)
−(50 × 141 × 50) − (262 × 262 × 5) − (510 × 25 × 25) = 80
Then we obtain the cofactor matrix (X'X)-1, which incidentally is also equal to its adjoint, why?
3266 350 −500
cofactor( ) = adj( )= 350 50 −60
−500 −60 80
Thus,
3266 350 −500 150

= 350 50 −60 812
−500 −60 80 1552
−1900 −23.75
= −20 = −0.25
440 5.5
We can also solve the problem by using the deviation form of regression but we have to
transform the variables by subtracting the mean from each, squaring them and also take the cross
products of each observation and sum them. While I leave this as your assignment to check what
we have done is correct, I will use the various formulae for deriving the sum of deviation squares
and cross products as follows:
∑ =∑ − = 141 − 5(5) = 141 − 125 = 16
∑ =∑ − = 510 − 5(10) = 510 − 500 = 10
∑ =∑ − = 262 − 5(5)(10) = 262 − 250 = 12
∑ =∑ − = 812 − 5(30)(5) = 2812 − 750 = 62
∑ =∑ − = 1552 − 5(30)(10) = 1552 − 1500 = 52
Therefore, given our earlier formula in deviation form, we have

∑ ∑ ∑
=
∑ ∑ ∑
16 12 62
=
12 10 52
12
We still need to obtain the cofactor and adjoint matrices for (x'x) but notice the reduction in
matrix dimension. Now, the determinant of this matrix is:
16 12
= 160 − 144 = 16
12 10
and the cofactor matrix is
10 −12
−12 16
Thus,
10 −12
( ′ ) =
−12 16
Consequently, we obtain the estimates for our two slope parameters as
10 −12 62
=
−12 16 52
−4 −0.25
= =
88 5.5
and of course, we obtain using the following formula
= − − = (30) − −0.25(5) − (5.5)(10) − 23.75
The problem of interpretation of coefficients in multiple regression
In the multiple regression model we have k - 1 explanatory variables. As a result we can talk of
two types of effects of explanatory variables on the dependent variable:
1. The joint effects of the explanatory variables on changes in Y, and
2. The partial effect of one of the explanatory variables on Y.
The partial effects can be interpreted in two ways.
a) Differential calculus: partial effects mean the effect of one variable holding all other variables
constant. This may not be correct, since in the collection of data, we do not keep all other
variables constant. In fact, all the variables are moving at the same time.
b) The partial effect of a variable is its effect on the dependent variable after eliminating the
effect of all the other explanatory variables.
The most important parameters in regression are the slops. A good way of doing so is to deal
with the econometric model in deviation form. We shall use a two explanatory variable in order
13
to put the ideas in perspective. In what follows, we shall also get rid of the subscript i to reduce
the cumbersome notation. Thus, let such a model be given as
= . + . + 1
a simple linear regression for y on x2, on the other hand is written as
= + 2
Let be the estimator βy2 in equation 2, then
∑
= ∑
Let . and . be the estimators of βy2.3 and βy3.2, respectively in equation 1. Now, recall
that the normal equations for such a regression model in equation 1 is given by
∑ = . ∑ + . ∑ 3
∑ = . ∑ + . ∑ 4
Divide equation 3 by ∑ to get
∑ ∑
∑
= . + . ∑
5
Notice the following

∑
1. ∑ is nothing , and
∑
2. ∑
is nothing but the regression coefficient of x3 (as dependent) on x2 (as independent); let
this be . Thus, equation 5 can be rewritten as

= . + . 5
Now
∑ ( )
= ∑
= ( )
Thus, = . if and only if cov(X2, X3) = 0, i.e., the two variables are uncorrelated. In other
words, there is multicollinearity does not exist perfectly, which is also called orthogonality
between the two explanatory variables.
For our previous example, we can calculate the correlation coefficient between X2, and X3 as
follows
14
∑
= = = 0.95
√ ×
∑ ∑
Therefore, we cannot use the simple regression model because X2, and X3 are correlated.
By analogy,
= + .
Note that we can manipulate the equations as follows to get the interesting results
is obtained from the regression
= +
is obtained from the regression
= +
Given our multiple regression model,
= . + . +
the partial derivatives with respect to the explanatory variables are:
= . and = .
Therefore, the coefficients in the model are the partial regression coefficients. If the model is
specified in levels, these parameters turn out to be the marginal effects of the explanatory
variable on the dependent variables. Note that this is not the case if the variables are not defined
in levels.
Now, using our multiple regression model and regressing y on x2 and x3 we obtain
∑ ∑ ∑ ∑
. = = ∑ ∑ (∑ )
Now, regress x2 on x3, i.e.,:

= +
then,
∑
= ∑
Now, consider the residuals of this model, e2.3
. = −
15
Proposition: . in the equation = . + . + is equal to the coefficient of
regression of y on . = − .
Proof regress y on e2.3 to get; i.e.,
= ( . ) . +
∑ .
( . ) = ∑ .
∑
=
∑
∑ ∑
=∑ ∑ ∑
∑
∑ ∑
∑
= ∑ ∑
∑ ∑ ∑
∑ ∑
∑ ∑ ∑ ∑ /∑ ∑ ∑ ∑ ∑
= ∑ ∑ (∑ ) (∑ ) /∑
= ∑ ∑ (∑ )
= .
Therefore . in the equation = . + . + is identical to the coefficient of

regression of y on . = − .
The auxiliary regression eliminates the influence of x3 on x2. We are just regressing y on the
residual (on what is left, so to speak).
Proposition: . in the equation = . + + is equal to the coefficient of
regression of . = − on . = − .
Proof:
Let ( . )( . ) be the coefficient of regression of the errors defined above. Then
∑ . . ∑
( . )( . ) = ∑
=
. ∑
Take the numerator:

∑ − − =∑ − − ∑ −
=∑ − − ∑ (= 0)
=∑ −
Now, take the denominator:
16
∑ − =∑ − − ∑ −
=∑ − − ∑ . (= 0)
=∑ −
Therefore
∑
( . )( . ) =∑
∑ ∑
= ∑ ∑
∑
∑ ∑
∑
= ∑
∑ ∑
∑
∑ ∑ ∑ ∑ /∑
= ∑ ∑ (∑ ) /∑
∑ ∑ ∑ ∑
= ∑ ∑ (∑ )
= .
Partial correlation
If = ( , ) we can now think of two variants of correlation concepts. The first couple are
those that we are already familiar with. Namely
1. -the measure of proportion of variation in Y that X2 alone explains
2.
-the measure of proportion of variation in Y that X3 alone explains but we can also think of
3. . -the measure of proportion of variation in Y that both X2 and X3 together explain
Given 3rd concept we would also like to know how much more we are explaining by adding one
more explanatory variable. For instance, if we start with X2 and add X3, we would like to measure
the increment to the correlation coefficient by including this variable. This is measured by what
is known as the partial correlation coefficient and is written as . . Now,
∑
=
∑ ∑
The partial correlation of X3 after X2 has been included would then be

∑ . .
. =
∑ . ∑ .
17
∑
=
∑ ∑
Let us 1st take the numerator of this equation and expand as follows:
∑ − − =∑ − − ∑ −
=∑ − − ∑ (= 0)
=∑ − ∑
∑
=∑ − ∑
∑
2nd let us take one of the elements in the denominator and expand
∑ − =∑ + −2
=∑ + ∑ −2 ∑
∑ ∑
=∑ + ∑
∑ −2 ∑
∑
(∑ ) (∑ )
=∑ + ∑
−2 ∑
(∑ )
=∑ − ∑
(∑ )
=∑ 1−∑ ∑
=∑ 1−
3rd we can similarly for the second element of the denominator get
∑ − =∑ (1 − )
Therefore the denominator boils down to
∑ − ∑ − = ∑ 1− ∑ (1 − )
= ∑ ∑ 1− (1 − )
Taking what we derived for the numerator and denominator together we get
∑ ∑
. = ∑ − ×
∑ ∑ ∑ ∑
18
∑ ∑ ∑
= − ×
∑ ∑ ∑ ∑ ∑ ∑
= − ×
Note that the sign of the ry2.3 need not have the same sign as that of ry2. Analogously, one can
easily obtain
. =
The relationships between simple, partial and multiple correlation coefficients: (Interpretation of
simple and partial correlation coefficients)
In the case of SLR, the simple r has a straight forward meaning: it measures the degree of
(linear) association (mind you, not causation) between the dependent (Y) and single independent
variable (X). But once we go beyond the SLR case, we need to be careful regarding to the
interpretation of the simple correlation coefficient. From our result earlier, we had
. =
Thus, we observe the following:

1. Even if ry2 = 0, ry2.3 will not be zero unless ry3 or r23 or both are equal to zero.
2. If ry2 = 0 and ry3 and r23 are non zero and are of the same sign, ry2.3 will be negative, whereas
if they are of opposite sign it will be positive
ry2 = 0 sign of ry3 sign of r23 sign of ry2.3
+ + -
- + +
+ - +
For example, let Y = Crop yield X2 = Rainfall X3 = Temperature
19
Assume ry2 = 0, i.e., no association between crop yield and rainfall. Assume further that ry3 is
positive, thus, there is a positive association between crop yield and temperature, and r23 is
negative, i.e., there is a negative association between rainfall and temperature. Then given our
formula for partial correlation, ry2.3 will be positive, that is, after accounting for the effect of
temperature into the equation, there is a positive association between yield and rainfall. Though,
the result may seem paradoxical, it is not surprising, since temperature affects both yield and
rainfall, in order to find out the net effect (relationship) between crop yield and rainfall, we need
to remove the influence of the “nuisance” variable (temperature). This example shows how one
could be misled by the simple correlation coefficient.
3. The terms ry2 and ry2.3 need not have the same sign.
4. We saw that r2 lies between 0 and 1. The same property holds for the square of the partial
correlation coefficient, i.e., 0 ≤ . 3 ≤ 1. If this is true then:
1≥
→ 1− (1 − )≥ + −2
→1− − + ≥ + −2
→1− − − ≥ −2
On the other hand we would also have
0≤
0≤ − = + −2
+ ≥ −2
+ ≥ −2
comparing the two inequalities, we obtain
→1− − − ≥ −2
→ ≥ + + − ≥ since both 1 − − − and +
are both greater than −2
20
5. Suppose ry3 = r23 = 0, this does not necessarily mean that ry2 is also zero. From our last result,
this is actually a value between 1 and 0. Thus, the fact that Y and X3 and X2 and X3 are
uncorrelated does not mean that Y and X2 are not correlated.
6. The expression . may be interpreted as the proportion of the variation in Y not explained
by the variable X3 that has been explained by the inclusion of X2 into the model. And thus could
be called the Coefficient of partial determination.
This leads us to the discussion of R2
Recall in the simple regression equation we had:

= =1− → = (1 − )
Thus,
=∑ (1 − )
Thus, for simple (one explanatory variable) regression we get
=∑ 1−
This can be shown as follows
=∑ ̂ =∑ −
=∑ + −2
∑ ∑
=∑ + ∑
∑ −2 ∑
∑
(∑ ) (∑ )
=∑ + ∑
−2 ∑

(∑ )
=∑ − ∑
(∑ )
=∑ 1−∑ ∑
=∑ 1−
With two explanatory variables we have
=∑ ̂ =∑ − −
21
=∑ − +
= ∑( − )
=∑ +∑ − 2∑
=∑ −∑ (∑ =∑ )
∑
=∑ 1−∑
=∑ 1− .
For the case of k explanatory variables this is generalized as,

=∑ ̂ =∑ 1− . …
Given a model with two explanatory variables:

∑ 1− . is the residual sum of squares after both X2 and X3 are included in the model.
∑ 1− . is nothing but the residual sum of squares after the inclusion of X2 only into the
model, and
. measures the proportion of this residual sum of squares explained by X3. Therefore the
unexplained (residual) sum of squares after X3 is also included is given
1− . ∑ 1− .
Which is also equal to

∑ 1− .
Hence we get the result

∑ 1− . = 1− . ∑ 1− .
1− . = 1− . 1− .
To interpretation of this result could be made by expanding this result as follows;

1− . = 1− . 1− .
1− . = 1− . − 1− . .
1− . − 1− . =− 1− . .
− . + . =− 1− . .
. − . = 1− . .
Thus
22
. = . + 1− . .
therefore R2 will not decrease if an additional variable is introduced into the model. The
proportion of variation in Y explained by both X2 and X3 is given by the proportion of variation in
Y explained by X2 plus that which is not explained by X2 times the proportion that is explained by
X3 after accounting for the influence of X2.
Thus
1. . ≡ Proportion of variation in Y explained jointly by X2 and X3
2. ≡ Proportion of variation in Y explained jointly by X2 alone, therefore
3. . − . ≡Incremental contribution of X3 to the explanation in the variation in Y, having
already included X2 in the model
4. 1 − ≡ Proportion of variation in Y that remains to be explained after having included
only X2 in the model.
We can think of each observation as being made up of and explained part, and an unexplained
part, we then define
= + ̂
∑( − ) is the total sum of squares (TSS)
∑ − is the explained sum of squares (ESS)
∑ ̂ is the residual sum of squares (RSS)
Then TSS = ESS + RSS
 How do we think about how well our sample regression line fits our sample data?
 Can compute the fraction of the total sum of squares (SST) that is explained by the
model, call this the R-squared of regression
 R2 = SSE/SST = 1 – SSR/SST
We can also think of R2 as being equal to the squared correlation coefficient between the
observed Yi and the values of ; that is
(∑( )( ))
= ∑( ) ∑( )
 R2 can never decrease when another independent variable is added to a regression, and
usually will increase
23
 Because R2 will usually increase with the number of independent variables, it is not a
good way to compare models
Given our model with two explanatory variables

= . + . + .
We propose the following:

√
. =
To prove this, given

∑ ∑ ∑ ∑
= ∑ ∑ (∑ )
we expand the numerator and denominator as follows

∑ ∑
a) ∑ ∑ −∑ ∑ =∑ ∑ − ∑
(∑ )
b) ∑ ∑ − (∑ ) =∑ ∑ 1− ∑ ∑
=∑ ∑ (1 − )
Therefore
∑ ∑
∑ ∑
∑
= ∑ ∑
∑ ∑
∑
∑
= ∑
∑ ∑ ∑
=∑ −∑ ∑
∑ ∑ ∑
= −
∑ ∑ ∑ ∑ ∑ ∑
∑ ∑ ∑ ∑
= −
∑ ∑ ∑ ∑ ∑ ∑ ∑
∑
= −
∑
We can calculate the R2 for our example in many ways, we take

∑ ∑ .
= ∑
= = 0.998
24
With the following regression equation
= −23.75 − 0.25 + 5.5 = 0.998
It would be interesting to see what the simple regressions in this example give us. We get
= 10.625 + 3.875 = 0.883
= −22 + 5.2 = 0.994
The simple regression equation predicts that an increase of one year of education results in an
increase of Birr 3875 in annual salary. However, after allowing for the effect of years of
experience we find from the multiple regression equation that it does not result in any increase in
salary. Thus omission of the variable "years of experience" gives us wrong conclusions about the
effect of years of education on salaries.
Regression through the Origin
Sometimes we impose 0 = 0 (economic theory or common sense), OLS then minimizes the sum
of squared residuals with the intercept set at zero.
In this case the normal equations and other formulas will be the same as before, except that there
will be no "mean corrections."--the first equation in the normal equations will not exist.
In the simple linear regression
= +
for instance we have only one normal equation
∑ = ∑
~ to show that these are different from the OLS estimates (hat), leading to
∑
= ∑
Assignment, find the normal equations and estimators for

= + +
Now given this, we can write the sample regression equation for the k variable case as
= + ⋯+
Here, X2 = 0, X3 = 0, …, Xk = 0 => the Y = 0.
The properties of OLS with intercept no longer hold for regression through the origin.
 The OLS residuals no longer have a zero sample average.
25
 R2 is defined as 1 - RSS/TSS, where =∑ = ∑( − ) and = ∑( − −
⋯− ) then R2 can actually be negative.
o To always have a nonnegative R-squared,
o R2 = squared correlation coefficient between the Y and
∑( )
=
∑( ) ∑
The average fitted value must be computed directly since it no longer equals Y
Problem with regression through the origin: if 0 in the population model is different from zero
=> the OLS estimators of the slope parameters will be biased.
Cost of estimating an intercept when 0 is truly zero is that the variances of the OLS slope
estimators are larger.
26
Properties of OLS estimators
Four assumptions, extending from simple regression model, under which the OLS estimators are
unbiased for the population parameters.
Bias in OLS when an important variable has been omitted from the regression.
Statistical properties have nothing to do with a particular sample:
Look at the property of estimators when random sampling is done repeatedly.
ASSUMPTION 1 (LINEAR IN PARAMETERS)
The model in the population can be written as:
= + + +⋯+ + = 1, 2, … ,
ASSUMPTION 2 (RANDOM SAMPLING)
We have a random sample of n observations, {(Xi2,Xi3,…,Xik,Yi): i = 1,2,…,n}, from the
population model described by our regression equation.
For each i;
= + + + ⋯+ + = 1, 2, … ,
i contains the unobserved factors for observation i that affect Y.
The OLS estimators j = 1, …, k are considered to be estimators of j, j = 1, …, k
OLS chooses the estimates for a given sample so that
 the ̂ average out to zero and
 the sample correlation between each X and the ̂ is zero.
For OLS to be unbiased, we need the population version of this condition to be true.
ASSUMPTION 3 (ZERO CONDITIONAL MEAN)
E( |X2, X3,…, Xk) = 0.
Assumption 3 can fail if:
 the functional relationship between the Y and X is misspecified:
Example, in the model
cons = 0 + 1inc + 2inc2 = u
o Exclusion of inc2 in the function when we estimate the model.
o If the true model has log(wage) as Y but we use wage as Y in regression
 Omitting an important factor that is correlated with any of X2, X3, …, Xk due to
27
o data limitations or ignorance
o if there is problem of measurement error in an X.
o One or more of the Xs is determined jointly with Y.
=> Here, may be correlated with an X
When Assumption 3 holds we say we have exogenous explanatory variables.
If cor(Xj, ) ≠ 0 then Xj is said to be an endogenous.
Origins of “exogenous” and “endogenous” is the analysis of simultaneous equations but the term
“endogenous X” has evolved to cover any case where an X is correlated with .
ASSUMPTION MLR 4 (NO PERFECT COLLINEARITY)
In the sample (and therefore in the population),
 none of the independent variables is constant, and
 there are no exact linear relationships among the independent variables.
This assumption concerns Xs: has nothing to do with the relationship between and X.
More complicated than the simple regression (now look at relationships between all X).
It allows the Xs to be correlated; not perfectly correlated; not allowing for any correlation among
the Xs, little use of multiple regression.
Perfect correlation between Xs exists when
1. one X is a constant multiple of another.
The model
cons = 1 = 2inc + 3inc2 + ∈
does not violate Assumption MLR.4: even though X3 = inc2.
Though an exact function of X2 = inc, inc2 is not an exact linear function of inc.
2. Do not include the same explanatory variable measured in different units
3. Specification of a model such as
log(cons) = 1 + 2log(inc) + 3log(inc2) + ,
where X2 = log(inc) and X3 = log(inc2).
4. when one X is an exact linear function of two or more Xs.
Solution to the perfect collinearity here is simple: drop any one of the Xs.
5. Small sample size
28
THEOREM (UNBIASEDNESS OF OLS)
Under Assumptions MLR.1 through MLR.4,
= , j = 1, 2, …, k,
for any values of the population parameter j.
Assumptions 1 and 2 are obvious: if we believe that the specified model is correct under the key
Assumption 3, then we can conclude that OLS is unbiased in these examples.
Note on interpretation: an estimate, as a fixed number, obtained from a particular sample, which
usually is not equal to the population parameter, cannot be unbiased.
The procedure by which the OLS estimates are obtained is unbiased when we view the
procedure as being applied across all possible random samples.
Including Irrelevant Variables in a Regression Model

Over-specifying a model: one (or more) of the Xs included in the model even though it has no
partial effect on Y. (That is, its population coefficient is zero.)
To illustrate the issue, suppose we specify the model as
= + + + +
Let
1. Assumptions 1 – 4 be satisfied.
2. X4 has no effect on Y after X2 and X3 have been controlled for: 3 = 0.
3. X4 may or may not be correlated with X2 or X3;
Thus,
E(Y|X2, X3, X4) = E(Y|X2, X3) = 1 + 2X2 + 3X4.

Since we do not know that 4 = 0, we are inclined to estimate the equation including X4:
= + + +
We have included the irrelevant variable, X4, in our regression.
What is the effect of including X4 in our estimation when its 4 = 0?
In terms of the unbiasedness of and , there is no effect.
29
Unbiasedness means = for any value of j, including j = 0.
Though will never equal 0, its average across many random samples will be zero.
In general, including one or more irrelevant variables in a multiple regression model (over-
specifying a model) does not affect the unbiasedness of the OLS estimators.
This does not mean it is harmless to include irrelevant variables.

Including irrelevant variables can have undesirable effects on the variances of the OLS
estimators.
Omitted Variable Bias: The Simple Case
Suppose we omit a relevant variable; i.e. we under-specify the model.
Claim: this problem generally causes the OLS estimators to be biased.
Suppose the true population model has two explanatory variables and an error term:
Y = 1 + 2X2 + 3X3 + ∈,
(wage = 1 + 2educ + 3abil + ∈)
Assume that this model satisfies assumptions 1 - 4.
Let our primary interest be 2 (the partial effect of X2 on Y).
For example, Y = wage (or log of hourly wage), X2 = education, and X3 = ability.
Unbiased estimator of 2 requires a regression of Y on X2 and X3
However, due to our ignorance (data unavailability) we estimate the model without X3 and
perform a simple regression of Y on X1 only, obtaining the equation
Y = 1 + 2X2 +
(wage = 1 + 2educ + μ)
where μ = 2X2 + ∈ (μ = abil + ).
The estimator of 2 from the simple regression of wage on educ is what we call .
We derive the expected value of conditional on the sample values of X2 and X3. Now,
∑( ) ∑
= ∑( )
= ∑
For each observation i

Yi = 1 + 2Xi2 + 3Xi3 + (not Yi = 1 + 2Xi2 + ui,). Take the numerator
30
∑ =∑ ( + + + )
= ∑ + ∑ + ∑ +∑
= ∑ + ∑ +∑
If we divide this by the denominator

∑ ∑ ∑ ∈ ∑ ∑
= ∑
= + ∑
and take the expectation conditional on the values of the independent variables, and use E( ) = 0,
we obtain
∑ ∑ [ ] ∑
= + ∑
= + ∑
Thus, does not generally equal 2: is biased for 1.

The ratio multiplying 2 is the slope coefficient from the regression of X3 on X2, which we can
write as
= +
Conditioning on the sample values of both Xs implies that is not random. Therefore, we can
write our result as
= +
which implies that the bias in is − =
This is often called the omitted variable bias.
is unbiased if
 3 = 0—so that X3 does not appear in the true model.
 If = 0, even if 2 ≠ 0.
Since = (sample covariance between X1 and X2)/ (sample variance of X1) => = 0 if, and
only if, X2 and X3 are uncorrelated in the sample.
Thus, if X2 and X3 are uncorrelated then is unbiased.
If E(X3|X2) = E(X3), then is unbiased without conditioning on the Xi3; => estimating 2,
leaving X3 in the error term does not violate the zero conditional mean assumption for the error,
once we adjust the intercept.
When X2 and X3 are correlated, has the same sign as the correlation between X2 and X3:
31
1. > 0 if corr(X2, X3) > 0 and
2. < 0 if corr(X2, X3) < 0.
Thus, the sign of bias( ) depends on the signs of both 3 and .
corr(X2, X3) > 0 corr(X2, X3) < 0
3 > 0 Positive bias Negative bias
3 < 0 Negative bias Positive bias
Conclusions
 What happens if we include variables in our specification that don’t belong?
 There is no effect on our parameter estimate, and OLS remains unbiased
 What if we exclude a variable from our specification that does belong?
 OLS will usually be biased
Variance of the OLS Estimators
 Now we know that the sampling distribution of our estimate is centered around the true
parameter
 Want to think about how spread out this distribution is
 Much easier to think about this variance under an additional assumption, so
 Assume Var( |X2, X3,…, Xk) = 2 (Homoskedasticity)
 Let X stand for (X2, X3,…, Xk)
 Assuming that Var( |X) = 2In also implies that Var(y| X) = 2In
 The 4 assumptions required for unbiasedness, plus homoskedasticity assumption are
known as the Gauss-Markov assumptions
=σ ( ′ )
Proof
= − −
We know from above that
= +( ′ )
Therefore
32
= [ +( ′ ) − ][ + ( ′ ) − ]
= [( ′ ) ][( ′ ) ]
Now
[( ′ ) ] = ( ′ )
Therefore
= [( ′ ) ( ′ ) ]
= [( ′ ) ( ′ ) ]
=( ′ ) [ ] ( ′ )
Now
[ ]=
Therefore
= [( ′ ) ( ′ ) ]
= [( ′ ) ( ′ ) ]
= ( ′ )
Given the Gauss Markov Assumptions
= , where
∑
is the R2 from regressing Xj on all other Xs

Components of OLS Variances
 The error variance: a larger 2 implies a larger variance for the OLS estimators
 The total sample variation: a larger x j implies a smaller variance for the estimators
 Linear relationships among the independent variables: a larger R2j implies a larger variance
for the estimators

Misspecified Models
Consider the model
= +
When the model should actually be
= + +
33
so that =∑ and =∑
.
Thus, < unless X2 and X3 are uncorrelated: if so they are the same
 The variance of the estimator is smaller for the misspecified model, unless 3 = 0, but the
misspecified model is biased
 As the sample size grows, the variance of each estimator shrinks to zero, making the variance
difference less important
Estimating the Error Variance
 We don’t know what the error variance, 2, is, because we don’t observe the errors, ui
 What we observe are the residuals,
 We can use the residuals to form an estimate of the error variance
∑
=
Thus
.
=
∑
The standard error formula is not a valid estimator of if the errors hetroskedastic.
Heteroskedasticity does not cause bias in the but leads to bias in the usual formula for
and invalidates the standard errors.
The Gauss-Markov Theorem
 Given our 5 Gauss-Markov Assumptions it can be shown that OLS is “BLUE”
 Thus, if the assumptions hold, use OLS
Thus, the least squares estimator is still BLUE in every sample. The proof is beyond the scope
of the course
Here too, we do not know σ2, thus we need to estimate it--its unbiased estimator is given by
∑
= =
And the estimated variance of is given as
= ( ′ )
34
Hypothesis testing and interval estimation
For both hypothesis testing and interval estimation we need the assumption of normality of the
error terms, i.e.,
∈ ~ [0, ]
Since, in = +∈, Y is a linear combination of the vector ∈ and the X’s are fixed, it follows
that
~ [ , ]
and since
=( ′ ) =
is a linear combination of Y, and X’s, it follows that
a) ~ [ , ( ′ ) ],
∑
b) = ~ , and
c) and are uncorrelated. Thus, under the assumption of normality of the error terms, the
estimators and the residuals are uncorrelated.
These three results are intuitive and follow our discussions in statistics and the SLRM. The third
result is important in terms of obtaining notations and definitional equations which will be useful
in the future, thus the need to prove it here.
From a above it follows that
− ( ( ′ ) ) ~ [ , 1]
Therefore;
− ( ( ) ) − ~
Using these two facts then we have
∈∈ ~ ∈ ∈
= +
The important thing for the purpose of hypothesis testing is the following proposition.
Eg. in our example we had
35
−23.75−0.25 +5.5
= = 0.99
(5.53) (0.69) (0.87)
The test of significance for each of these parameters is given as:
: = 0; : ≠ 0; ∀ = 1,2,3
Here
1. , and are normal with mean , and
2. Denote the correlation coefficient between X2 and X2 by r23 then
=∑
=∑
, =∑
= + +2 , +
, =− + ,
, =− + ,
Notes
1. Note that the higher the value of r23 (other things staying the same), the higher the variances of
and . In fact, if r23 is very high, we cannot estimate and with much precision.
2. Note that ∑ (1 − ) is the RSS from a regression of X2 on X3. Similarly, ∑ (1 − ) is
the residual sum of squares from a regression of X3 on X2. We can now see the analogy between
the expressions for the variances in the case of simple regression = / , where
RSS2 is the residual sum of squares after regressing X2 on the other variable, that is, after
removing the effect of the other variable. This result generalizes to the case of several
explanatory variables. In that case RSS2 is the residual sum of squares from a regression of X2 on
all the other Xs.
Analogous to the other results in the case of simple regression, we have the
following results:
3. If RSS is the residual sum of squares then RSS/σ2 has a χ2-distribution with (n - 3) degrees of
freedom. This result can be used to make confidence interval statements about σ2.
4. If = RSS/(n - 3), then [ ] = σ2 or is an unbiased estimator for σ2
36
5. If we substitute for σ2 in the expressions in the variance formulas, we get the estimated
variances and covariances. The square roots of the estimated variances are called the standard
errors (to be denoted by SE). Then
, and
each has a χ2distribution with (n - 3) degrees of freedom.

6. In addition to results 3 to 5, which have counterparts in the case of simple regression, we have
one extra item in the case of multiple regression, that of confidence regions and joint tests for
parameters. We have the following results.
= ∑ − + 2∑ − − +∑ −
has an F-distribution with degrees of freedom 2 and (n - 3). This result can be used to construct a
confidence region for and , together and to test and together.
We shall state later results 1 to 6 for the general case of k explanatory variables. But first we will
consider an illustrative example.
Example
The most common production function estimated is of the following form
=
Now take the logarithm and augment it with an error term (the econometric model is)
= + + +∈ where =
So, we can write this as
= + + +
Y = ln output
X2 = ln labor input
X3 = ln capital input
Assume Xs are nonstochastic. The following data are obtained from a sample
of size n = 23 (23 individual firms):
= 10 =5 = 12
∑ = 12 ∑ = 12 ∑ =8
∑ = 10 ∑ =8 ∑ = 10
37
(a) Compute , and , and their standard errors. Present the regression equation.
(a) The normal equations are
12 +8 = 10
8 + 12 =8
Eliminate the first equation by solving for and substitute into the second equation to get
= −
8 − + 12 =8
− + 12 =8
= = 0.2
Then
12 + 8 × 0.2 = 10
12 = 8.4
.
= = 0.7
= − − = 12 − 0.7 × 10 − 0.2 × 5 = 4
∑ ∑ . × . ×
. = = ∑
= = 0.86
The residual sum of squares =∑ (1 − ) = 10(1 − 0.86) = 1.4. Hence,

.
=( )
= = 0.07
(∑ )
= ∑ ∑
=
Hence we have
∑ (1 − ) = 12 1 − = 12 = =
Hence
= =
38
, = =
= + 10 − 2(10)(5) +5 = 8.7935
Substituting the estimate of , which is 0.07, in these expressions and taking the square roots,
we get
.
= = = 0.102
= √8.7935 × 0.07 = 0.78

Thus, the regression equation is
= 4( . ) + 0.7( . ) + 0.2( . ) = 0.86
Figures in parentheses are standard errors.
(b) Find the 95% confidence intervals for , , and and test the hypotheses = 0 and =
0 separately at the 5% significance level.
Using the t-distribution with 20 d.f., we get the 95% confidence intervals for , , and as
± 2.086 = 4 ± 1.63 = (2.37,5.63)
± 2.086 = 0.7 ± 0.21 = (0.49,0.91)
± 2.086 = 0.2 ± 0.21 = (−0.01,0.41)
The hypothesis = 1 will be rejected at the 5% significance level since = 1 is outside the
95% confidence interval for . The hypothesis = 0 will not be rejected because = 0 is a
point in the 95% confidence interval for .
Assignment: Using the χ2 distribution for 20 d.f., find the 95% confidence interval for σ2
Tests of linear restrictions of coefficients
Sometimes we may be interested in testing a linear function of coefficients. For example, the
most common production function estimated is of the following form
=
Now the returns to scale of this equation is given by β1 + β2 and if it equals unity 1, we say we
have constant returns to scale and the production function is said to have the Cobb-Douglas
form. Now take the logarithm of this equation to obtain
= + + +∈
39
Suppose that we have data on output, labor and capital of a firm over time, or we have these
variables for different firms in a point in time and want to see whether the technology of the
firm(s) exhibits the Cobb-Douglas specification. All we have to do is estimate this equation
using multiple regression and check whether the sum of the two slope parameters in the log-
linear equation sum to unity! (H0: β2 + β3 = 1) using a t-test. Let = , , = 2,3
The calculated t is obtained as

= (
For our example
. . .
| |= = = 1.2
. . . .
But, there is a much more general formulation for restrictions using the F-test. Recall if
~ ( , )
Then
= ~ (0, ) is called a standard normal distribution
If we transform a Z ( standard normal) distribution by taking its square we obtain a chi-squared

distribution and is written as:
~
If we sum n such distributions we get
∑ ~
Now, if we have 2 different distributions with p and q degrees of freedom; say
~ ; and
~
Then
/
/
~ ,
The F distribution is handy when we impose restrictions on our regressions and want to see
whether our restrictions are statistically meaningful.
Before we go there, however, we look at the analysis of variance for the case of SLR
40
Source of variation Sum of squares Degrees of freedom Mean square
X ESS = ∑ 1 ESS/1
Residual e RSS=∑ − ∑ n-2 RSS/(n - 2)
Total TSS = ∑
Assignment: Use the results from the example we worked out in class to fill these out for the
table above.
Now suppose for a given data we obtain the following
In presentations the TSS is broken down into ESS and RSS in order to test the significance
of the ESS. In the case of simple linear regression this is equivalent to the test of significance of
β2. For our class example we had
Source of Variation Sum of Squares Degrees of Freedom Mean Square
X 15.75 1 15.75
Residual 14.65 8 1.83
Total 30.4 9
Under the assumptions of the model,
2
~ ( −2) , and
2
~ (1)
Thus
/
/
/( ) = /( )
~ ,( )
Suppose we have the following results from a data set of 10 observations

∑ = 28; ∑ = 30.4; ∑ = 21
Then;
∑
= ∑ = = 0.75
ESS = ∑ = 21(0.75) = 15.75

RSS = ∑ − ∑ = 34.4 − 15.75 = 14.65
.
= = = 1.83
( )
41
.
=∑ = = 0.065 → = √0.065 = 0.256
Therefore
.
= = = 2.93
.
. .
= = = 8.6
. / .
Note that = 2.93 = 8.6 = this is not an accident

I would also like you to note the following relationships
(∑ ) (∑ ) ∑
ESS = ∑ = = = ∑ , thus
∑ ∑ ∑
RSS = ∑ − ∑ = 1− ∑
Hence
/
2 ∑ 2 2 ( ) 2
2 2 2
= = = =
/( ) 1− 2 2 ∑ 2 /( ) 1− 2 2 /( ) 1− 2 2
And since
( −2)
= = → ( − 2) = 1−
( − 2) = −
( − 2) + =
( − 2) + =
=( )
In MLR models the F-statistic has an even more powerful interpretation.

To use the F-test in the multiple regression setting take the model
= + + + ⋯+ + = 1, 2, … ,
Estimate it to obtain the ; = 1, … , obtain the RSS from this model and call it the
Unrestricted RSS (URSS). Now, put as many restrictions as you would like. For example say
= 0; ∀ = 2, … , ; that is all slope parameters are zero. This is actually called the overall test
of the regression. What it is saying is that none of the explanatory variables has a power to
explain the variation in Y. If we reject this, we know that some of the parameters influence and
some do not but cannot identify which is which.
42
In this case:
( )/
~ ,( )
/( )
where URSS = unrestricted residual sum of squares

RRSS = restricted residual sum of squares obtained by imposing the restrictions of the
hypothesis
r = number of restrictions imposed by the hypothesis .
Example, consider the hypothesis = 1, = 0 for the earlier model:
= 4( . ) + 0.7( . ) + 0.2( . ) = 0.86
Recall URSS, which is the RSS in the multiple regression is 1.4.
What we do not know is the RRSS; what we need to do to obtain this is to minimize
∑( − −1 +0 )
Since both and are specified, there is only to be estimated and we get
∑ − −
∑ − − =0→∑ − − (−1) = 0 → ∑ − −∑ =0
= − =0
Thus, the RRSS is given by
=∑ − − =∑ − −( − ) = ∑( − ) =∑ +∑ − 2∑
For our example above,
10 + 12 − 2(10) = 2
Since we have two restrictions r = 2, n = 23 and k = 3, we have
( . )/
= . /
= 4.3
This is then compared to the tabulated F2, 20 = 3.49;

Since Fc > Ft we reject the hypothesis!
In the special case where the hypothesis is
: = =⋯= =0
We have URSS = ∑ (1 − ) and RRSS = ∑ . Hence the test is given by
43
( )/ ∑ 2 −∑ 2 1− 2 /( −1) ( − ) ∑ 2 1− 1− 2 2
( − )
= = 2 = 2 =
/( ) ∑ 2 1− /( − ) ( −1) ∑ 2 1− 1− 2 ( −1)
which has an F-distribution with degrees of freedom k-1 and (n - k).

This tests the hypothesis that none of the X's influence Y. Rejection of this hypothesis leaves
us with the question: Which of the X's are useful in explaining Y?
2
Degrees of freedom and adjusted R2( R )
Given n observations on one dependent and two explanatory variables, we can estimate the
parameters of a linear regression
= + + +
We know that the normal equations for the estimated errors (residuals) must satisfy the following
restrictions
∑ ̂ = 0; ∑ ̂ = 0; ∑ ̂
In essence what we are saying here is that there are n - 3 residuals allowed to vary because, given
any n - 3 residuals, the remaining three residuals can be obtained by solving the above three
restriction. This fact is expressed by saying that there are n - 3 degrees of freedom.
The estimate of the variance of the model is,
∑
= = =
Note that
1. as the number of explanatory variables increases, the RSS goes down.
2. as the number of explanatory variables increases the degrees of freedom goes down.
Thus, what happens to depends on the proportionate decrease in the value of the numerator
and the denominator.
That is, there will be a point, at least theoretically, where the will actually start to decline.
The effect of the degrees of freedom will also be felt on R2 and thus should be taken care of by
making some form of adjustment. This is why we have to report the adjusted R2 when we put up
multiple regression. This will take care of the effects of the degrees of freedom. We know that
. ≥ . ≥
Now, we know that the coefficient of determination in the case of SLR is
∑
=1− =1−∑
44
When we have two explanatory variables
∑ .
. = 1− = 1− ∑
When we have three explanatory variables

∑ .
. =1− = 1− ∑
etc.
Where, of course, ∑ ̂ is the minimum sum of squares of residuals. Therefore, it follows that
∑ ̂ . ≤∑ ̂ . ≤∑ ̂ .
Thus, R2, the way we defined it, keeps on increasing until it reaches 1, at least, theoretically, as
we add explanatory variables and thus does not take into account the degrees of freedom
problem. The adjusted R2 takes care of this, and is defined as R2 adjusted for degrees of freedom.
It is defined as
∑ /( )
=1−∑ )
/(
∑ /( )
1− =∑ /( )
( )∑ ( )
1− =( =( (1 − )
)∑ )
Therefore
( )
1− . =( )
1− .
Example:
Suppose we modeled demand for a certain good as a function of its own prices and income in the
following two ways:
= + + + 1
Where Q is quantity demanded of the good, P is its price and Y is disposable income. Note that
in this formulation
= ; =
45
are the marginal effects of quantity demanded with respect to P and Y respectively. Suppose we
have
= + + + + 2
Here, however, the marginal effects of quantity demanded with respect to P and Y respectively
are given by
= + ; = +
Now, suppose we regressed these two equations on a given data set and obtained
. < . ( )
But
. > . ( )
Given these results, we would prefer equation 1 to equation 2, basing our judgment on the
adjusted R2. It can be shown that for such a case the t ratio for the will be less than unity. That
is, if
. > . ( ) ↔| |<1
Similarly
. > . ( ) ↔ . < . ( )
Note: selection of models based on R2 should be made sparingly and very carefully. This
criterion should be used only when the dependent variable is the same (not transformed in
anyway).
Example: suppose we estimated two models as follows
= + + 1
= + + 2
The same data set is used to estimate these two models and we obtained R2 of 0.618 and 0.705,
respectively. The question is can we choose equation 2? The answer is no, since the units of
measurement for the dependent variable are not identical. To make them comparable, we have to
1. calculate for each observation in equation 2
2. take the its antilog as e , and
3. compute a quasi R2 as
46
∑( )
= 1 − ∑( )
This quasi-R2 is comparable to the R2 obtained in the first equation.
Tests for Stability

When we estimate a multiple regression equation and use it for predictions at future points of
time we assume that the parameters are constant over the entire time period of estimation and
prediction. To test this hypothesis of parameter constancy (or stability) some tests have been
proposed. These tests can be described as:
1. Analysis-of-variance tests.
2. Predictive tests.
The Analysis-of-Variance Test
Suppose that we have two independent sets of data with sample sizes n, and n,, respectively. The
regression equation is
= + + +⋯+ + = 1, 2, … ,
= + + + ⋯+ + = 1, 2, … ,
The first subscript denotes the data set and the second subscript denotes the variable. A test for
stability of the parameters between the populations that generated the two data sets is a test of the
hypothesis:
: = ; = ; = ;…; =
If this hypothesis is true, we can estimate a single equation for the data set obtained by pooling
the two data sets.
Here we can use the F-test based on URSS and RRSS. To get the unrestricted residual sum of
squares we estimate the regression model for each of the data sets separately. Define
RSS1 =residual sum of squares for the first data set
RSS2 = residual sum of squares for the second data set
has a χ2-distribution with d.f. (n1 - k)
has a χ2-distribution with d.f. (n2 - k)
Since the two data sets are independent (RSS1 + RSS2)/σ2 has a χ2 distribution with d.f. (n1 + n2 -
2k). We will denote (RSS1 + RSS2) by URSS. The restricted residual sum of squares RRSS is
47
obtained from the regression with the pooled data. (This imposes the restriction that the
parameters are the same.) Thus RRSS/σ2 has a χ2-distribution with d.f. = (n1 + n2) - k .
( )/
=
/( )
which has an F-distribution with degrees of freedom (k + 1) and (n1 + n2 - 2k).

Example 1: Stability of the Demand for Food Function
US data on food consumption before and after the 2nd WW (1927-1941) and (1948-1962) and for
the entire period. Suppose that we want to test the stability of the parameters in the demand
function between the two periods. The regression results are the following:
All 1927- 1948- All 1927- 1948-
ln(exp) observations 1941 1962 observations 1941 1962
β1 4.047 4.555 5.052 8.028802 4.056727 16.62054
0.136 0.201 0.901 1.798118 7.521799 27.24275

β2 -0.119 -0.235 -0.237 -0.99621 -0.12238 -2.74278
0.04 0.053 0.154 0.397012 1.70352 5.899208

β3 0.241 0.243 0.141 -0.71835 0.368136 -2.41304
0.013 0.023 0.046 0.432424 1.885168 6.01058
β4 0.211176 -0.02827 0.553157
0.095131 0.426718 1.301887

N 30 15 15 30 15 12
2
R 0.9731 0.9066 0.8741 0.9774 0.9066 0.8424
RSS 0.002869 0.001151 0.000544 0.002412 0.001151 0.000535
F 488.32 58.24 41.67 374.54 35.6 25.95
Df 27 12 12 26 11 11
0.01031 0.0098 0.00673 0.00963 0.01023 0.00698
The sum of the RSS from the two separated equations gives us URSS
= 0.001151 + 0.000544 = 0.001695
with d.f. = 12 + 12 = 24
RRSS is the RSS from a regression for the pooled data
= 0.002866 with d.f. = 27
This regression from the pooled data imposes the restriction that the parameters are the same in
the two periods. Hence
48
( )/ (0.002866 0.001695)/
= = = 5.53
/( ) 0.001695/( )
From the F-tables with d.f. 3 and 24 we see that the 5% point is about 3.01 and the 1% point is
about 4.72. Thus, even at the 1% level of significance, we reject the hypothesis of stability. Thus
there is no case for pooling.
Note that if we look at the individual coefficients, is almost the same for the two regressions.
Thus it appears that the price elasticity has been constant but it is the income elasticity that has
changed in the two periods. Consider now equation 2. We now have
URSS = 0.001151 + 0.000535 = 0.001686 with d.f. = 11 + 11 = 22
RRSS = 0.002412 with d.f. = 26
Hence
( )/ (0.002412 0.001686)/
= = = 2.79
/( ) 0.001686/( )
From the F-tables with d.f. 4 and 22 we see that the 5% point is about 2.82. So, at the 5%
significance level, we do not reject the hypothesis of stability.
One can ask how this result came about. If we look at the individual coefficients for Equation 2
for the two periods 1927-1941 and 1948-1962 separately, we notice that the t-ratios are very
small, that is, the standard errors are very high relative to the magnitudes of the coefficient
estimates. Thus the observed differences in the coefficient estimates between the two periods
would not be statistically significant. When we look at the regression for the pooled data we
notice that the coefficient of the interaction term is significant, but the estimates for the two
periods separately, as well as the rejection of the hypothesis of stability for equation 1 , casts
doubt on the desirability of including the interaction term.
49
T q p y All 1927- 1948- All 1927- 1948-
ln(exp) observations 1941 1962 observations 1941 1962
1927 88.9 91.7 57.7 β1 4.047 4.555 5.052 8.029 4.057 16.621
1928 88.9 92 59.3 0.136 0.201 0.901 1.798 7.522 27.243
1929 89.1 93.1 62 β2 -0.119 -0.235 -0.237 -0.996 -0.122 -2.743
1930 88.7 90.9 56.3 0.04 0.053 0.154 0.397 1.704 5.899
1931 88 82.3 52.7 β3 0.241 0.243 0.141 -0.718 0.368 -2.413
1932 85.9 76.3 44.4 0.013 0.023 0.046 0.432 1.885 6.011
1933 86 78.3 43.8 β4 0.211 -0.028 0.553
1934 87.1 84.3 47.8 0.095 0.427 1.302
1935 85.4 88.1 52.1 N 30 15 15 30 15 12
2
1936 88.5 88 58 R 0.9731 0.9066 0.8741 0.9774 0.9066 0.8424
1937 88.4 88.4 59.8 RSS 0.002869 0.001151 0.000544 0.002412 0.001151 0.000535
1938 88.6 83.5 55.9 F 488.32 58.24 41.67 374.54 35.6 25.95
1939 91.7 82.4 60.3 Df 27 12 12 26 11 11
1940 93.3 83 64.1 0.01031 0.0098 0.00673 0.00963 0.01023 0.00698
1941 95.1 86.2 73.7
1948 96.7 105.3 82.1
1949 96.7 102 83.1
1950 98 102.4 88.6
1951 96.1 105.4 88.3
1952 98.1 105 89.1
1953 99.1 102.6 92.1
1954 99.1 101.9 91.7
1955 99.8 100.8 96.5
1956 101.5 100 99.8
1957 99.9 99.8 99.9
1958 99.1 101.2 98.4
1959 101 98.8 101.8
1960 100.7 98.4 101.8
1961 100.8 98.8 103.1
1962 101 98.4 105.5
50

03 Multiple Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 Multiple Regression

Uploaded by

Copyright:

Available Formats

Multiple Regression

Note also that is a function of , since = −

Therefore, in summation notation

= 2 ∑ (− ) for all k at the minimizing level of estimates

Thus, at the minimizing level,

Then the normal equations boil down to

Whereby multiplying out the whole equation one gets

Regression in deviation form

It then follows that

3266 350 −500 150

Therefore, given our earlier formula in deviation form, we have

The problem of interpretation of coefficients in multiple regression

Notice the following

this be . Thus, equation 5 can be rewritten as

Now, regress x2 on x3, i.e.,:

Now, consider the residuals of this model, e2.3

Therefore . in the equation = . + . + is identical to the coefficient of

Take the numerator:

The partial correlation of X3 after X2 has been included would then be

Thus, we observe the following:

ry2 = 0 sign of ry3 sign of r23 sign of ry2.3

For example, let Y = Crop yield X2 = Rainfall X3 = Temperature

This leads us to the discussion of R2

Recall in the simple regression equation we had:

With two explanatory variables we have

For the case of k explanatory variables this is generalized as,

Given a model with two explanatory variables:

Which is also equal to

Hence we get the result

To interpretation of this result could be made by expanding this result as follows;

Given our model with two explanatory variables

We propose the following:

To prove this, given

we expand the numerator and denominator as follows

We can calculate the R2 for our example in many ways, we take

Regression through the Origin

Assignment, find the normal equations and estimators for

Including Irrelevant Variables in a Regression Model

E(Y|X2, X3, X4) = E(Y|X2, X3) = 1 + 2X2 + 3X4.

This does not mean it is harmless to include irrelevant variables.

For each observation i

If we divide this by the denominator

Thus, does not generally equal 2: is biased for 1.

Given the Gauss Markov Assumptions

is the R2 from regressing Xj on all other Xs

for the estimators

And the estimated variance of is given as

Eg. in our example we had

each has a χ2distribution with (n - 3) degrees of freedom.

The residual sum of squares =∑ (1 − ) = 10(1 − 0.86) = 1.4. Hence,

= √8.7935 × 0.07 = 0.78

For our example

If we transform a Z ( standard normal) distribution by taking its square we obtain a chi-squared

Suppose we have the following results from a data set of 10 observations

ESS = ∑ = 21(0.75) = 15.75

Note that = 2.93 = 8.6 = this is not an accident

In MLR models the F-statistic has an even more powerful interpretation.

where URSS = unrestricted residual sum of squares

This is then compared to the tabulated F2, 20 = 3.49;

which has an F-distribution with degrees of freedom k-1 and (n - k).

When we have three explanatory variables

This quasi-R2 is comparable to the R2 obtained in the first equation.

Tests for Stability