You are on page 1of 34

Introductory Econometrics

Statistical Properties of the OLS Estimator, Interpretation of


OLS Estimates and Effects of Rescaling

Monash Econometrics and Business Statistics

2019

1 / 34
Recap

I The multiple regression model can be written in matrix form as

y = X β + u
n×1 n×(k+1) (k+1)×1 n×1

I The OLS procedure finds a linear combination of X that is closest


to the vector y, i.e. the length of its error vector (OLS residuals) is
the shortest
I This implies that the OLS residual vector is perpendicular to all
columns of X, i.e.,
X0 û = 0
which leads to the famous OLS formula

βb = (X0 X)−1 X0 y

2 / 34
I A consequence of orthogonality of residuals and columns of X is that
n
X n
X n
X
(yi − ȳ )2 = (ŷi − ȳ )2 + ûi2
i=1 i=1 i=1

or
SST = SSE + SSR
I This leads to the definition of the coefficient of determination R 2 ,
which is a measure of goodness of fit

R 2 = SSE/SST = 1 − SSR/SST

3 / 34
I OLS formula gives us fitted values that are closest to the actual
values in the sense that the length of the residual vector is the
smallest possible.
I But why should this be a good estimate of the unknown parameters
of the conditional expectation function?
I We ask if this formula produces the best information we can get
from our data about the unknown parameters β

4 / 34
Lecture Outline
I Interpretation of the OLS estimates in multiple regression (textbook
reference 3-2a to 3-2e)
I Properties of good estimators
I Properties of the OLS estimator βb
1. OLS is unbiased: The expected value of βb (Textbook reference
3-3, and Appendix E-2)
2. The Variance of βb (Textbook reference 3-4, and Appendix E-2)
3. OLS is BLUE (Gauss-Markov Theorem): The Efficiency of βb
(Textbook reference 3-5, and Appendix E-2)
I Units of measurement: do the results qualitatively change if we
change the units of measurement? (textbook reference 2-4a, 6-1
(exclude 6-1a))

5 / 34
Interpretation of OLS estimates
Example: The causal effect of education on wage

I Consider again the wage equation written as:

wage = β0 + β1 educ + β2 IQ + u

where IQ is IQ score (in the population, it has a mean of 100 and sd


= 15).
I Primarily interested in β1 , because we want to know the value that
education adds to a person’s wage.
I Without IQ in the equation, the coefficient of educ will show how
strongly wage and educ are correlated, but both could be caused by
a person’s ability.
I By explicitly including IQ in the equation, we obtain a more
persuasive estimate of the causal effect of education provided that
IQ is a good proxy for intelligence.

6 / 34
Interpretation of OLS estimates
Example: The causal effect of education on wage

I If we only estimate a regression of wage on education, we cannot be


sure if we are measuring the effect of education, or if education is
acting as a proxy for smartness. This is important, because if the
education system does not add any value other than separating
smart people from not so smart, the society can achieve that much
cheaper by national IQ tests!
7 / 34
Interpretation of OLS estimates
Example: The causal effect of education on wage

I The coefficient of educ now shows that for two people with the
same IQ score, the one with 1 more year of education is expected to
earn $42 more.

8 / 34
Interpretation of OLS estimates

I Consider k = 2 for simplicity


I The conditional expectation of y given x1 and x2 (also know as the
population regression function) is

E (y | x1 , x2 ) = β0 + β1 x1 + β2 x2

I The estimated regression (also known as the sample regression


function) is
ŷ = β̂0 + β̂1 x1 + β̂2 x2 .
I The formula
∆ŷ = β̂1 ∆x1 + β̂2 ∆x2
allows us to compute how predicted y changes when x1 and x2
change by any amount.

9 / 34
I What if we increase x1 by 1 unit and hold x2 fixed, that is, ∆x1 = 1
and ∆x2 = 0?

∆ŷ = β̂1 if ∆x1 = 1 and ∆x2 = 0

I In other words, β̂1 is the change in predicted y when x1 increases by


1 unit while x2 is held fixed.
I We also refer to β̂1 as the estimate of the partial effect of x1 on y
holding x2 constant
I Yet another legitimate interpretation is that β̂1 estimates the effect
of x1 on y after the influence of x2 has been removed (or has been
controlled for)

10 / 34
I Similarly,
∆ŷ = β̂2 if ∆x1 = 0 and ∆x2 = 1
I Let’s go back to regression output and interpret the parameters.

wage
[ = −128.89 + 42.06 educ + 5.14 IQ
n = 935, R 2 = .134

I 42.06 shows that for two people with the same IQ, the one with one
more year of education is predicted to earn $42.06 more in monthly
wages.
I Or: Every extra year of education increases the predicted wage by
$42.06, keeping IQ constant (or “after controlling for IQ”, or ”after
removing the effect of IQ”, or ”all else constant”, or ”all else
equal”, or ”ceteris paribus”)

11 / 34
Effects of rescaling

The upshot: changing units of measurement does not create any extra
information, so it only changes OLS results in predictable and
non-substantive ways.
I Scaling x → cx: Any change of unit of measurement that involves
multiplying x by a constant (such as changing dollars to cents or
pounds to kilograms), does not change the column space of X, so
will not change ŷ. Therefore, it must be that the coefficient of x
changes such that x β̂ stays the same.

ŷ = βb0 + βb1 x1 + βb2 x2


ŷ = c∗ + β
β c∗ x ∗ + β
c∗ x2
0 1 1 2

x1∗ = cx1 c∗ = βb0 , β


⇒ β 0
c∗ = βb1 /c, β
1
c∗ = βb2
2

12 / 34
13 / 34
I Change of units of the form x → a + cx: Any change of unit of
measurement that involves multiplying x by a constant and adding
or subtracting a constant (such as changing from ◦ C to ◦ F ) does
not change the column space of X either, so will not change ŷ.
Therefore, it must be that the β̂ changes in such a way that Xβ̂
stays the same.

ŷ = βb0 + βb1 x1 + βb2 x2


ŷ = c∗ + β
β c∗ x ∗ + β
c∗ x2
0 1 1 2

x1∗ = a + cx1 c∗ = βb0 − aβb1 /c, β


⇒ β 0
c∗ = βb1 /c, β
1
c∗ = βb2
2

I In both cases, since ŷ does not change, residuals, SST, SSE and
SSR all stay the same, so R 2 will not change.

14 / 34
I Changing the units of dependent variable y → cy : This changes the
length of y but the column space of X is still the same. So, ŷ will
be multiplied by c, and since both y and ŷ are multiplied by c, the
residuals will be multiplied by c also.

y = βb0 + βb1 x1 + βb2 x2 + û


∗ c∗ + β c∗ x1 + βc∗ x2 + uˆ∗
y = β0 1 2
∗ c∗ = c βb0 , β
c∗ = c βb1 , β
c∗ = c βb2 , uˆ∗ = c û
y = cy ⇒ β 0 1 2

I In this case, SST, SSE and SSR all change but all will be multiplied
by . . ., so again, R 2 will not change.

15 / 34
Estimators and their Unbiasedness

I An estimator is a formula that combines sample information and


produces an estimate for parameters of interest.
I Example: Sample average is an estimator for the population mean.
I Example: βb = (X0 X)−1 X0 y is an estimator for the parameter vector
β in the multiple regression model.
I Since estimators are functions of sample observations, they are ...
I While we do not know the values of population parameters, we can
use the power of mathematics to investigate if the expected value of
the estimator is indeed the parameter of interest
I Definition: An estimator is an unbiased estimator of a parameter of
interest if its expected value is the parameter of interest.

16 / 34
The Expected Value of the OLS Estimator

I Under the following assumptions, E (β)


b =β

Multiple Regression Model


MLR.1 Linear in Parameters E.1 Linear in Parameters
y = β0 + β1 x1 + · · · + βk xk + u y = X β + u
n×1 n×(k+1) (k+1)×1 n×1
MLR.2 Random Sampling
We have a random sample of n observations E.3 Zero Conditional Mean
MLR.4 Zero Conditional Mean E (u | X) = 0
(n×1)
E (u | x1 , x2 , . . . , xk ) = 0
MLR.3 No Perfect Collinearity E.2 No Perfect Collinearity
None of x1 , x2 , . . . , xk is a constant and there X has rank k + 1
are no exact linear relationships among them

17 / 34
I We use an important property of conditional expectations to prove
that the OLS estimator is unbiased: if E (z | w ) = 0, then
E (g (w )z) = 0 for any function g . For example
E (wz) = 0, E (w 2 z) = 0, etc.
I Now let’s show E (β)
b =β
I The assumption of no perfect collinearity (E.2) is immediately
required because ...
I Step 1: Using the population model (Assumption E.1), substitute
for y in the estimator’s formula and simplify

βb = (X0 X)−1 X0 y
= (X0 X)−1 X0 (Xβ + u)
= β + (X0 X)−1 X0 u

18 / 34
I Step 2: Take expectations:

E (β)
b = E (β + (X0 X)−1 X0 u)
= β + E ((X0 X)−1 X0 u)
= β

using Assumption E.3, since E (u | X) = 0 ⇒ E ((X0 X)−1 X0 u) = 0


I Note that all assumptions were needed and were used in this proof.

19 / 34
I Are these assumptions too strong?
I Linearity in parameters is not too strong, and does not exclude
non-linear relationships between y and x (more on this later)
I Random sample is not too strong for cross-sectional data if
participation in the sample is not voluntary. Randomness is
obviously not correct for time series data.
I Perfect multicollinearity is quite unlikely unless we have done
something silly like using income in dollars and income in cents in
the list of independent variables, or we have fallen into the “dummy
variable trap” (more on this later)
I Zero conditional mean is not a problem for predictive analytics,
because for best prediction, we always want our best estimate of
E (y | x1 , x2 , . . . , xk ) for a set of observed predictors.

20 / 34
I Zero conditional mean can be a problem for prescriptive analytics
(causal analysis) when we want to establish the causal effect of one
of the x variables, say x1 on y controlling for an attribute that we
cannot measure. E.g. Causal effect of education on wage keeping
ability constant:

We want to estimate β1 in: wage = β0 + β1 educ + β2 ability + u


We run the regression: wage = α̂0 + α̂1 educ + v̂
∂ability
E (α̂1 ) = β1 + β2 6= β1
∂educ
I α̂1 is a biased estimator of β1
I This is referred to as “omitted variable bias”
I One solution is to add a measurable proxy for ability, such as IQ
I Other solutions in ETC3410

21 / 34
I Zero conditional mean can be quite restrictive in time series analysis
as well.
I It implies that the error term in any time period t is uncorrelated
with each of the regressors, in all time periods, past, present and
future.
I Assumption MLR.4 is violated when the regression model contains
a lag of the dependent variable as a regressor. E.g. We want to
predict this quarter’s GDP using GDP outcomes for the past 4
quarters
I In this case, the regression parameters are biased estimators.

22 / 34
The Variance of the OLS Estimator

I Coming up with unbiased estimators is not too hard. But how the
estimates that produce are dispersed around the mean determines
how precise they are.
I To study the variance of β,
b we need to learn about variances of a
vector of random variables var-cov matrix

23 / 34
I We introduce an extra assumption:

MLR.1 Linear in Parameters E.1 Linear in Parameters


y = β0 + β1 x1 + · · · + βk xk + u y = X β + u
n×1 n×(k+1) (k+1)×1 n×1
MLR.2 Random Sampling
We have a random sample of n observations E.3 Zero Conditional Mean
MLR.4 Zero Conditional Mean E (u | X) = 0
(n×1)
E (u | x1 , x2 , . . . , xk ) = 0
MLR.3 No Perfect Collinearity E.2 No Perfect Collinearity
None of x1 , x2 , . . . , xk is a constant and there X has rank k + 1
are no exact linear relationships among them
MLR.5 Homoskedasticity E.4 Homo+Randomness
E (u 2 | x1 , x2 , . . . , xk ) = σ 2 Var (u | X) = σ 2 In

24 / 34
I Under these assumptions, the variance-covariance matrix of the OLS
estimator conditional on X is

Var (βb | X) = σ 2 (X0 X)−1

because

25 / 34
I We can immediately see that given X the OLS estimator will be
more precise (i.e. its variance will be smaller) the smaller σ 2 is
I It can also be seen (not as obvious) that as we add observations to
the sample, the variance of the estimator decreases, which also
makes sense
I Gauss-Markov Theorem: Under Assumptions E.1 to E.4 (or MLR.1
to MLR.5) βb is the best linear unbiased estimator (B.L.U.E.) of β.
I This means that there is no other estimator that can be written as a
linear combination of elements of y that will be unbiased and will
have a lower variance than β.
b
I This is the reason that everybody loves the OLS estimator.

26 / 34
Estimating the Error Variance

I We can calculate β,
b and we showed that

b = σ 2 (X0 X)−1
Var (β)

but we cannot compute this because we do not know σ 2


I An unbiased estimator of σ 2 is
Pn
2
2
i=1 ûi û0 û
σ̂ = =
n−k −1 n−k −1

I The square root of σ̂ 2 , σ̂, is reported by eviews in regression output


under the name standard error of the regression, and it is a
measure of how good the fit is (the smaller, the better)

27 / 34
I Why do we divide SSR by n − k − 1 instead of n?
I In order to get an unbiased estimator of σ 2

û0 û
E( ) = σ2 .
n−k −1
If we divide by n the expected value of the estimator will be slightly
different from the true parameter (proof of unbiasedness of σ̂ 2 is
not required)
I Of course if the sample size n is large, this bias is very small
I Think about dimensions: ŷ is in the column space of X, so it is in a
subspace with dimension k + 1
I û is orthogonal to column space of X, so it is in a subspace with
dimension n − (k + 1) = n − k − 1. So even though there are n
coordinates in û, only n − k − 1 of those are free (it has n − k − 1
degrees of freedom)

28 / 34
Standard error of each parameter estimate

I The standard error of each β̂j , j = 0, 1, . . . , k is the square root of


the diagonal elements of

Var\
(β̂ | X) = σ̂ 2 (X0 X)−1

I These are reported under the heading Std. Error in eviews


I When reporting regression results, we usually report these standard
errors under their corresponding parameter estimates

29 / 34
I Example: Predicting wage with education and experience

I We report these results as

[ = −3.39 + 0.64 education + 0.07 experience


wage
(0.77) (0.05) (0.01)

30 / 34
Summary

I Under the assumptions that:


1. the population model is linear in parameters;
2. the conditional expectation of true errors given all explanatory
variables is zero;
3. the sample is randomly selected; and
4. the explanatory variables are not perfectly collinear
the OLS estimator βb is an unbiased estimator of β.
I If we add no conditional heteroskedasticity to the above
assumptions, then OLS is the B.L.U.E. for β

31 / 34
Summary

I Interpretation of OLS estimates in multiple regression: Very


important to interpret the coefficients using the context, and very
important to remember that each βb estimates the partial effect of
its corresponding x all other x staying constant.
I Linear scaling: Linear transformation of dependent or independent
variables changes the outcomes of linear regression in exactly
predictable ways, and all of these outcomes are substantively the
same. Hence, we should use the units of measurement which make
presentation of results more understandable without worrying that
changing the units of data may affect our results.

32 / 34
Digression: The Variance-Covariance Matrix

I Consider n random variables v1 , v2 , . . . , vn with means


µ1 , µ2 , . . . , µn . We can write:
   
v1 µ1
 v2  µ2 
v =  .  , E (v) = µ =  . 
   
n×1  ..  n×1  .. 
vn µn

I Denote the variance of vi by σi2 for i = 1, . . . , n, and covariance of


vi and vj by σij for i = 1, . . . , n, j = 1, . . . , n and i 6= j

33 / 34
I We define the variance-covariance matrix of v as

Var (v) = E (v − µ)(v − µ)0


 2 
σ1 σ12 · · · σ1n
σ21 σ22 · · · σ2n 
=  .
 
.. .. 
 .. . . 
σn1 σn2 ··· σn2

I Note that since σij = Cov (vi , vj ) = Cov (vj , vi ) = σji , this matrix is
symmetric.
I An important property: If A is an n × k matrix of constants, then

Var (A0 v) = A0 Var (v)A

I Always check that dimensions are compatible when using matrices


variance

34 / 34