Multiple Linear Regression Model

Multiple Regression Analysis: Estimation
Sharper Sikota October 2022

Quantitative Methods in Economics School of Economics, Kwame Nkrumah
University
0 / 59
Econometrics: why do we do it?
•We love playing with data and are interested in relationships
•Hypothesis testing: we want to answer questions
•We read the literature and think we can do it better ☺
•Ultimately: IDENTIFICATION
1
Empirical Economics
• Using data and statistical methods to examine the evidence – a
supplement/complement to theory
• Distinguishing between correlations and causal relationships is the key
task in empirical economics
• Correlated: Two economic variables move together.
• Causal: Movement in one variable causes movement in the other.
• Variable A and B move together: Either (1) by chance, or (2) A causes B,or
(3) B causes A, or (4) another variable C causes movement in A and B.
2
Icecream and Cancer, Nicholas Cage and Drowning
3
School enrollment rates are lower among Progresa beneficiaries.
Why?
What the characteristics of grant beneficiaries?
• (1) kids with existing education deficit, or (2) those needing to leave school to work, or (3) those
needing to care for someone in a large household. (4) They may live in households with individuals
who don’t value education highly – as they never got to have it or had poor quality education – (5)
and they live with family who also can’t help their kids with their education. And finally – (6) Progresa
kids probably live in areas with low quality schools
•The probability of enrolment is way lower even before Progresa
•Long story short: low income (variable C) causes grant receipt (variable A), low income (variable C)
causes poor enrolment (variable B), thus we observe a correlation between grant receipt (A) and
poor enrolment (B). A does not cause B!
•Maybe the grant was not enough to get them through school, given their unfortunate circumstances,
or it came too late.
•We expect the coefficient on grant receipt to be negative.
•How can you really identify the grant effect?
4
Thoughts
•Identifying causal effects is exceptionally difficult – we don’t expect it from
you in your project, but we do expect you to think and talk about it
•We can use a randomised controlled trial to establish causality but it has its
own downsides logistically and ethically.
•What we can do right now, is to control for all those many important factors
which confound identification
This is where we begin in Chapter 3
5
Multiple Regression
Analysis: Estimation
Chapter 3 Roadmap
1. Motivation for Multiple Regression
2. Mechanics and Interpretation of Ordinary Least Squares
3. The Expected Value of the OLS Estimators
4. The Variance of the OLS Estimators
5. Efficiency of OLS: The Gauss-Markov Theorem
6. The Language of MLR
7. Scenarios for MLR
7
8
Stata Output Interpretation
𝜎෢2
n
SSE
𝑅2
SSR
SST
𝜎ො
෢1
𝛽
𝑠ෞ𝑒(𝛽෢0 ) and ෞ ෢1 , which

𝑠𝑒 𝛽
෢0 are just the square roots of
𝛽
the estimated variances
9
The Model with Two Independent Variables
•Consider the effect of education on monthly wage
𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝑢
• What’s missing?
• There is a lot which might be left in u
• What variable might be related to both of them?
10
• Interpretation? An additional year of education is
associated with a R694 increase in the monthly wage, C.P.
11
The Model with Two Independent Variables
•Let’s add in age of the individual (in years)
𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝛽2 𝑎𝑔𝑒 + 𝑢
•We take age out of the error term and explicitly put it in equation
Note:
◦ Education tends to be correlated with age
◦ 𝛽2 measures the ceteris paribus effect of age on wage
12
• 1 more year of age is associated with a R148 increase in monthly wage
• What happened to the coefficient on education?
13
MLR Advantage: Flexibility in Functional Form
•Does wage change linearly with age? Not really.

•We can add in 𝒂𝒈𝒆𝟐 to model a more realistic relationship
𝑤𝑎𝑔𝑒 = 𝛽0 + 𝛽1 𝑒𝑑𝑢𝑐 + 𝛽2 𝑎𝑔𝑒 + 𝛽3 𝑎𝑔𝑒 2 + 𝑢
It depends on how
•NB! Be super careful when interpreting the coefficients. old the person
𝜕𝑤𝑎𝑔𝑒 already is
= 𝛽2 + 2𝛽3 𝑎𝑔𝑒
By how much does wage increase 𝜕𝑎𝑔𝑒
if age is increased by one year?
14
What is the impact of a change in age on wages? For a 22 year-old or a 60 year old?
𝜕𝑤𝑎𝑔𝑒 𝜕𝑤𝑎𝑔𝑒
= 107.3 + 2 ∗ 0.51𝑎𝑔𝑒 = 107.3 + 2 ∗ 0.51 ∗ 22 = 𝑅129.7
𝜕𝑎𝑔𝑒 𝜕𝑎𝑔𝑒
15
Was this a Linear Regression Model?
•Yes
•It does assume a quadratic relationship between wage and age BUT
•The model must be linear in the coefficients (𝛽𝑗 ), not the variables (linear
regression definition)
Note: There is a detailed interpretation of quadratic models

in Chapter 6
16
Multiple Linear Regression: Terminology
Explain variable y in terms of variables x1, x2, …, xk
Error term,
Partial effects, coefficients, slope parameters disturbance,
Intercept
unobservables
Dependent variable, explained

variable, response variable,
endogenous variable, outcome Independent variables, explanatory
variable… variables, regressors, control or
exogeneous variables, covariates,
determinants…
17
Exact Interpretation
• NB! Holding other factors fixed = ceteris paribus = all things being equal
• For those with the same levels of x2, what is the impact of
changing x1 by one unit, on y?
• The coefficients are also called partial/marginal effects
18
Conditional Expected Value
Often we forget what the conditional expectation is:
An unconditional expectation: E(𝑋) is just a number that we calculate
Here is a conditional expectation, E(𝑋|𝑌1, 𝑌2, 𝑌3)
This often confuses. Remember, E(𝑋|𝑌1, 𝑌2, 𝑌3) means calculate the expected value of X, when the
values of 𝑌1, 𝑌2, 𝑌3 are given or set at particular values. For example:
E 𝑋 𝑌1, 𝑌2, 𝑌3 = E 𝑋 𝑌1 = 5, 𝑌2 = 1, 𝑌3 = 18
Calculate the expected value of X, but first set the values of 𝑌1, 𝑌2, 𝑌3 as above.
What if the conditional mean equals the unconditional? E.g. E 𝑋 𝑌1, 𝑌2, 𝑌3 = 𝐸(𝑋)
This implies the expected value of X doesn’t change, even if you set 𝑌1, 𝑌2, 𝑌3 to different values.
Thus X is independent of of 𝒀𝟏, 𝒀𝟐, 𝒀𝟑: i.e. Corr 𝑋 𝑌1 = 0, Corr 𝑋 𝑌2 = 0, Corr 𝑋 𝑌3 = 0
19
Key Assumption: Zero Conditional Mean
E 𝑢 𝑥1, 𝑥2, … , 𝑥𝑘 = 0
•This is the same assumption as previously – SLR.4: E 𝑢 𝑥 = 0
•All the factors in u, the unobserved term, are uncorrelated with the x’s.
•The other way to say it: E 𝑢 𝑥1, 𝑥2, … , 𝑥𝑘 = 𝐸 𝑢 = 0
•This will not hold if any of the x’s are correlated with anything in u
•Remember u is just a variable, whose mean we can calculate.
We will return to this concept.
20
Chapter 3 Roadmap

21
Random Sampling
•A random sample says every person in the sample of size n had an equal chance
of being selected
•Or, every household had an equal probability of being selected.
•The sample is then representative of the population
•Start with a random sample (n individuals, k independent variables)
•The subset of individuals i, with n individuals in total, with data for variables
𝑥1, … 𝑥𝑘 is represented as: { 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑘, 𝑦𝑖 : 𝑖 = 1, … 𝑛}
22
How do we obtain the OLS estimates?
• Start with a random sample (n individuals, k independent variables)
{ 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑘, 𝑦𝑖 : 𝑖 = 1, … 𝑛}
• In the 2 variable case: 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥1 + 𝛽መ2 𝑥2

•How do we obtain the estimates of 𝛽1 and 𝛽2 ?
• We want to explain as much as possible of y with our model, i.e., minimise
the sum of squared residuals – using ordinary least squares
• Why do we minimise the squared residuals and not just the residuals?
23
24
Obtaining the OLS Estimates: 𝛽መ𝑘
25
Obtaining the OLS Estimates
•So how do minimize the sum of squared residuals?
•We have the tools – we need to differentiate it, with respect to

each of its variables, and set each of those derivatives to zero
•The SSR has k+1 elements in it: y, and x1, …, xk
and k+1 variables: 𝛽መ0 , 𝛽መ1 ,…, 𝛽መ𝑘
26
How do we differentiate the SSR?
27
Given the SSR:
•NB: We differentiate w.r.t.

the 𝛽መ0 , 𝛽መ1 , …, 𝛽መ𝑘 , NOT the x’s.
•We have data for the x’s already,

we must calculate 𝛽መ0 , 𝛽መ1 , …, 𝛽መ𝑘
𝝏𝑺𝑺𝑹
•We need ෡ , for i = 0, 1, 2, … , k
𝝏𝜷𝒊
•Set the derivatives equal to zero,
to minimize SSR
28
We obtain:
CLEAN UP:
We divide through by -2 to get
rid of the 2s and the minus
signs.
29
Finally:
- Also solve using the method of moments: it uses E(u) = 0 and E(xju) = 0, j = 1, 2, …, k
- Clearly we don’t do this by hand.
- We would solve this using linear algebra. How are we guaranteed unique solutions?
30
Interpretation of the OLS Regression Equation
By how much does y change if the j-th independent variable is

increased by one unit, holding all other independent variables and
the error term constant
◦ MLR holds the other x variables fixed even if, in reality, they are
correlated with the x variable under consideration - "Ceteris paribus“
◦ NB! We still assume that unobserved factors in u do not change if the
explanatory variables change: E 𝑢 𝑥𝑖 = 𝐸(𝑢) = 0
31
Determinants of Wages: Interpretations
𝑤𝑎𝑔𝑒
ෟ = −9557 + 865𝑒𝑑𝑢𝑐 + 147𝑎𝑔𝑒
Holding age fixed, 1 more year of education is associated with another

R865 in monthly wages
Or: Comparing two workers with the same age, but the education level of
worker A is one year higher, we predict worker A to have a wage that is
R865 higher than that of worker B
Holding education fixed, 1 more year of age is associated with an
additional R147 in monthly wages`
32
Properties of OLS in any Sample
• Fitted values and residuals
Fitted or predicted values

Residuals
• Algebraic properties of OLS regression
Deviations from regression Correlations between deviations Sample averages of y and of the
line sum to zero and regressors are zero regressors lie on regression line
33
Cov X,Y
Correlation and Covariance: Corr X, Y =
σ𝑋σY
Note:
𝑛
1
𝐸 𝑢ෝ𝑖 = ෍ 𝑢ෝ𝑖 = 0
𝑛
1
Thus σ𝑛1 𝑢ෝ𝑖 =0
In addition
Why does this: σ𝑛1 𝑥𝑖𝑗𝑢ෝ𝑖 = 0 tell us anything about
Co𝑟𝑟(𝑥𝑖𝑗𝑢ෝ𝑖 )?
We know
Cov 𝑋, 𝑌 = 𝐸(𝑋𝑌) − 𝐸(𝑋)𝐸(𝑌)
Cov 𝑥𝑖𝑗,ෞ
𝑢𝑖 = 𝐸 𝑥𝑖𝑗𝑢ෝ𝑖 − 𝐸 𝑥𝑖𝑗 𝐸 𝑢ෝ𝑖 = 𝐸 𝑥𝑖𝑗𝑢ෝ𝑖
𝑛
1
𝐸 𝑥𝑖𝑗𝑢ෝ𝑖 = ෍ 𝑥𝑖𝑗𝑢ෝ𝑖 = 0
𝑛
1
Implying Co𝑟𝑟(𝑥𝑖𝑗𝑢ෝ𝑖 ) = 0
Cov 𝑋, 𝑌 = 𝐸[(𝑋 − 𝐸 𝑋 𝑌 − 𝐸 𝑌 ]
σ𝑋, σY > 0: they don’t change the sign of Cov(X,Y)
34
OLS Facts
(1) We know the sample average of the residuals is zero and thus:
Given 𝑢ෝ𝑖 = 𝑦𝑖 − 𝑦ෝ𝑖, we average both sides, the LHS = 0, and we get:
𝑦ത = 𝑦ො
(2) The sample covariance between each xj and 𝑢ො is zero and thus:
𝐶𝑜𝑣(ෝ
𝑦𝑖 , 𝑢ෝ𝑖 ) = 0
𝑦ෝ𝑖 is a function of all the x’s, and hence is also not correlated with ui.
(3) The point (𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒌 , 𝒚 ) always lies on the regression line (page 74):
𝑦ത = 𝛽መ0 + 𝛽መ1 𝑥ഥ1 + 𝛽መ2 𝑥ഥ2 + ⋯ + 𝛽መ𝑘 𝑥ഥ𝑘
Residuals: 𝑢ෝ𝑖 = 𝑦𝑖 − 𝑦ෝ𝑖
•What if 𝑢ෝ𝑖 > 0?
•That means our actual 𝐲𝐢 > our predicted 𝒚ෝ𝒊, so i.e. the model
underpredicted the person’s y value.
•Similarly 𝑢ෝ𝑖 < 0 implies 𝑦𝑖 < 𝑦ෝ𝑖 , i.e. an overprediction

•E.g. in a wage equation, should you quit your job if your residual is
negative?
36
Simple and Multiple Regression Compared
• Simple regression of y on x1:
𝑦෤ = 𝛽෨0 + 𝛽෨1 𝑥1
• Multiple regression of y on x1 and x2:

𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥1 + 𝛽መ2 𝑥2
• We can write 𝛽෨1 = 𝛽መ1 + 𝛽መ2 𝛿ሚ1 (Note: We didn’t prove this)
Where 𝛿ሚ1 is the slope coefficient from the regression of x2 on x1
37
Given 𝛽ሚ1 = 𝛽መ1 + 𝛽መ2 𝛿ሚ1 , when is 𝛽ሚ1 = 𝛽መ1 ?
1. 𝛽መ2 = 0
◦ This is when the partial effect of x2 on y is zero
2. 𝛿ሚ1 = 0
◦ This is when x1 and x2 are uncorrelated
• We can use the formula to compare 𝛽෨1 and 𝛽መ1 .
• We regress wage on age: What is the expected impact of adding education?

38
Goodness-of-Fit
Same Definitions as Before:
NB! You should know how to use the relationship SST = SSE + SSR to derive
the R-squared.
39
Goodness-of-Fit
•Decomposition of total variation
𝑆𝑆𝑇 = 𝑆𝑆𝐸 + 𝑆𝑆𝑅
•R-squared NB: The R2 can never

𝑆𝑆𝐸 decrease if another explanatory
𝑅2 = = 1 − 𝑆𝑆𝑅/𝑆𝑆𝑇 variable is added to the
𝑆𝑆𝑇
regression
•Alternative expression for R-squared
R2 is equal to the squared correlation
coefficient between the actual and the
predicted value of the y variable
Why is this a useful expression?

40
Example: Determinants of Wages
41
R-squared never
decreases
Should we keep pasta_spend in the regression?

42
Should we keep father_educ in the regression?
43
Adding Paternal Education
Interpretation:
◦ Higher father‘s education increases wage
◦ There is limited additional explanatory power as the R2 barely increases
•NB:
◦ Even if the R-squared is small, the regression may still provide good
estimates of the causal relationships!
• Caveat – with regression through the origin (no 𝛽መ0 ), R2 can be negative
•With macro data, R2 can be very high, due to underlying time trends
44
Chapter 3 Roadmap
45
Standard Assumptions for Multiple Regression
•Assumption MLR.1 (Linear in parameters) The population model is

linear in the parameters (the
coefficients, NOT the
variables)
•Assumption MLR.2 (Random sampling)
The data is a random sample
drawn from the population
Each data point therefore follows the population equation
46
Standard Assumptions: MLR.3
•Assumption MLR.3 (no perfect collinearity)
"In the sample (and therefore in the population), none of the independent
variables is constant and there are no exact linear relationships
among the independent variables“
◦ MLR.3 only rules out perfect collinearity/correlation between explanatory

variables; imperfect correlation is allowed
◦ If an explanatory variable is a perfect linear combination of other
explanatory variables it is superfluous and we can throw it out
◦ Constant variables are also ruled out
47
Violations of Assumption MLR.3
• One variable is a constant multiple of the other
◦ E.g. Regress wage on education in years, and education in decades
Educ_decades = educyrs/10
• A variable can be expressed as an exact linear function of others
◦ E.g. Wage on number of children in hh, number of adults in hh, and hh size
HHsize = numchildrenHH + numadultHH (don’t include all 3!)
• Too small sample size - MLR.3 fails if n < k + 1 or extreme bad luck
• Question: Could we include log(inc) and log(inc2) in a wage equation?
48
MLR.4: Zero Conditional Mean
The value of the explanatory variables
E 𝑢𝑖 𝑥1𝑖 , 𝑥2𝑖 , … , 𝑥𝑘𝑖 = 𝐸 𝑢 = 0 Should not contain any information about
the mean of the unobserved factors
◦ In MLR, the zero conditional mean assumption is more likely to hold

because fewer things end up in the error
•Example: Wage
If age was not included in the regression, it would end up in the error
term; it would be hard to defend that educ is uncorrelated with u
49
Zero Conditional Mean Misconceptions
Which is a better way of putting MLR.4?
E 𝑢𝑖 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑘 = 0
Or
E 𝑢𝑖 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑘 = 𝐸 𝑢 = 0?
The 2nd line: the part that E(u|x) = E(u) is most important!
• That E(u) = 0 is immaterial (we could have made it any constant value)
• It is most important to know that u and x must be uncorrelated
50
When does MLR.4 fail?
E 𝑢𝑖 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑘 = 0
This could fail with mis-specified functional form:
• E.g. leave out age2 in a wage equation, or used wage instead of log wage.
• It also fails if we omit an important determinant of y which is correlated
with any of the xs.
→ This might happen if we are lacking data, or don‘t know what to include
•If MLR.4 is violated, the OLS estimators are biased
51
Endogenous vs Exogenous Variables
◦ Endogenous variables
◦ Explanatory variables that are correlated with the error term
◦ Endogeneity is a violation of assumption MLR.4
◦ Exogenous variables
◦ Explanatory variables that are uncorrelated with the error term
◦ MLR.4 holds if all explanatory variables are exogenous
◦ Exogeneity
◦ key assumption for a causal interpretation of the regression
◦ and for unbiasedness of the OLS estimators
52
When is an x variable endogenous?
• If we have omitted variables
→ E.g. no data on IQ in a wage equation)
• If the x suffers from measurement error
→ E.g. household income
• If explanatory variables are determined jointly with y
→E.g. price and quantity
53
NB! How do MLR.3 and MLR.4 differ?
•MLR.3 – no perfect collinearity
◦ We can tell immediately if MLR.3 holds or doesn’t
◦ Stata just won’t run the regression – it is smarter than us
•MLR.4 – zero conditional mean

◦ Relationship between unobserved factors and the x’s
◦ We can never be sure - 
◦ Critical assumption!
54
55
Unbiasedness of OLS (Theorem 3.1)
• Unbiasedness is an average property in repeated samples

• Thought experiment!
• In a specific sample, 𝛽መ may be far away from 𝛽
• The procedure by which OLS estimates are obtained is unbiased
• Without context, we have no reason to believe our estimate is
more likely to be too big/small
56
Does this regression satisfy MLR.1 – MLR.4?
57
Including Irrelevant Variables in a Regression
= 0 in the population
• Doesn‘t affect unbiasedness of 𝛽1 and 𝛽2 because under MLR.1-MLR.4:
𝐸 𝛽መ1 = 𝛽1 , 𝐸 𝛽መ2 = 𝛽2 and 𝐸 𝛽መ3 = 𝛽3 = 0
However, including irrevelant variables may increase

sampling variance.
58
Omitted Variable Bias (OVB): The Simple Case
True model
(contains x1 and x2)
𝑦෤ = 𝛽෨0 + 𝛽෨1 𝑥1
Estimated model
•We use “~” rather than “^” (x2 is omitted)
for the underspecified model
59
60
What is 𝜷𝟏 if we leave out x2?
If x1 and x2 are correlated, assume a linear
regression relationship between them
If y is only regressed If y is only regressed error term

on x1 this will be the on x1, this will be the
estimated intercept estimated slope on x1
•NB! ALL estimated coefficients will be biased

•𝛽0 is an innocent bystander
61
Eg: OVB in a Wage Equation
Will both be positive
The return to education 𝛽1 will be overestimated because 𝛽2 𝛿1 > 0 .
It will look as if people with many years of education earn very high wages,
BUT this is partly due to the fact that people with more education are also more able on
average.
62
When is there no OVB? ෩ 𝟏 biased?
Is 𝜷
• If the omitted variable is irrelevant, or uncorrelated with x1. If not:
෩𝟏 = 𝜷
𝜷 ෡𝟏 + 𝜷
෡ 𝟐𝜹
෩𝟏
෩ 𝟏 ):
We must find 𝑬(𝜷 E(𝛽෨1 ) = 𝐸(𝛽መ1 + 𝛽መ2 𝛿ሚ1 ) = 𝐸(𝛽መ1 ) + 𝐸(𝛽መ2 𝛿ሚ1 )
= 𝐸(𝛽መ1 ) + 𝐸(𝛽መ2 )𝛿ሚ1 = 𝛽1 + 𝛽2 𝛿ሚ1
The bias is the difference between the two: 𝐵𝑖𝑎𝑠 𝛽෨1 = 𝐸 𝛽෨1 − 𝛽1 = 𝛽2 𝛿ሚ1
BIAS = ZERO if either of its component parts are zero.

63
Summary of OVB: 𝛽2 𝛿ሚ1
෩𝟏 > 0
𝜹 ෩𝟏 < 0
𝜹
Corr(x1,x2) > 0 Corr(x1,x2) < 0
𝛽2 > 0 Positive bias Negative bias
𝛽2 < 0 Negative bias Positive bias
64
OVB Terminology: 𝐵𝑖𝑎𝑠 𝛽ሚ1 = 𝐸 𝛽ሚ1 − 𝛽1 = 𝛽2 𝛿ሚ1
• Upward bias NB: 𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝑢

𝐸 𝛽෨1 > 𝛽1 𝑦෤ = 𝛽෨0 + 𝛽෨1 𝑥1
𝛽1 is what you want, E(𝛽෨1)is what you get

• Downward bias
𝐸 𝛽෨1 < 𝛽1
• Biased towards zero (from either side)

𝐸 𝛽෨1 is closer to zero than is 𝛽1
65
What happened here? Is there OVB?
66
𝛿ሚ1 < 0 (? ) Why might this be the case?
67
OVB: More General Cases
True model (contains

x1, x2 and x3)
Estimated model
(x3 is omitted)
◦ No general statements possible about direction of bias

◦ Analysis as in simple case IF we assume
regressor uncorrelated with others
68
E.g: Omitting Ability in the Wage Equation
•We add exper to the wage equation used previously, with omitted variable abil
•What happens to the coefficient on exper if we omit abil?
•The coefficients on both educ and exper will be biased, even if Corr(abil,exper) = 0
•Will the coefficient on educ be biased?
•We can treat this like the simple 2 variable case if (1) if Corr(abil,exper) = 0, and if Corr(educ,exper) =
0. is only correlated with exper, and not educ, and educ and exper are not correlated.
•This is quite unlikely
•We usually just ignore the other variables and focus on our variable of interest (e.g. x1), and
discuss whether its coefficient is biased – but this is only ok if x2 to xk are uncorrelated with x1.
69
Discussion
•Recall our multiple wage regression:

•What are some omitted variables in this regression?
70
Chapter 3 Roadmap
71
Standard MLR Assumptions: Continued
• Assumption MLR.5 (Homoscedasticity)

The explanatory variables x must not be related to
the variance of the unobserved factors, u
• Example: Wage equation

This assumption may be hard
to justify in many cases NB
• Short hand notation

All explanatory variables are
with collected in a random vector
72
2
Wait. Why is 𝑉𝑎𝑟 𝑢 𝑥 = 𝑉𝑎𝑟 𝑦 𝐱 = 𝜎 ?
Given: y = β0+ β1x1 + β2x2 + ... + βkxk + u
Let A = β0+ β1x1 + β2x2 + ... + βkxk.
Then y = A + u, and Var(y|x) = Var(A + u|x)
Then Var(y|x) = Var(A|x) + Var(u|x) because A and u are independent,
Cov (A,u) = 0 (if E(u|x) = E(u) = 0)
Therefore Var(y|x) = 0 + Var(u|x) = 0 + σ2 = σ2
Var(A|x) = 0 because once the values
of x are given, A does not vary.
73
The Canonical Heteroskedasticity E.g.
74
Gauss-Markov Assumptions
• MLR.1 through MLR.5 are the Gauss Markov assumptions for cross
sectional regression
• Assumptions MLR.1 and MLR.4 summarized:
𝐸 𝑦 𝐱 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘
◦ where x = set of all independent variables (x1, …, xk)

• Assumption MLR.5 is the same as:
𝑉𝑎𝑟 𝑦 𝐱 = 𝜎 2
75
Theorem 3.2:
Sampling Variances of the OLS Slope Estimators
Under Assumptions MLR.1 – MLR.5:

Variance of the
𝜎2 error term
𝑉𝑎𝑟 𝛽መ𝑗 =
𝑆𝑆𝑇𝑗(1−𝑅𝑗2 )
Total sample variation in xj: R-squared from a regression of xj on all other

independent variables
NB! this is tattoo worthy, t shirt worthy, write on the wall worthy
76
𝜎2
The Components of Variance: 𝑉𝑎𝑟 𝛽መ𝑗 =
𝑆𝑆𝑇𝑗(1−𝑅𝑗2)
෡ to be small. WHY?
•NB! We want 𝑽𝒂𝒓(𝜷)
•The error variance (𝜎 2 )
◦ A high 𝜎 2 increases the sampling variance due to more "noise“ in the equation
◦ A large error variance makes estimates imprecise
◦ 𝜎 2 does not decrease with sample size. WHY? It is a population level parameter
• The total sample variation in the explanatory variable (SSTj)
◦ More sample variation leads to more precise estimates
◦ Total sample variation automatically increases with the sample size
◦ Increasing the sample size is a way to get more precise estimates
77
𝜎2
𝑆𝑆𝑇𝑗(1 − 𝑅𝑗2 )
OLS Variance Components: Multicollinearity
Investigating linear relationships among the x’s (𝑅𝑗2 )
Regress xj on ALL other independent variables (including a constant)
The R-squared of this regression will be the higher

the better xj can be linearly explained by the other x‘s
◦ Var(𝛽መ𝑗 ) will be higher the better xj can be linearly explained by the other x‘s
(Why? Look to see where 𝑅𝑗2 is in the variance formula)
◦ The problem of almost linearly dependent explanatory variables is
called multicollinearity (i.e. 𝑅𝑗2 → 1 for some xj )
78
𝜎2
Multicollinearity 𝑉𝑎𝑟 𝛽መ𝑗 =
𝑆𝑆𝑇𝑗(1−𝑅𝑗2)
• 𝑅𝑗2 = 0 An example: a perfectly randomly allocated x

◦ Smallest 𝑉𝑎𝑟 𝛽መ𝑗
◦ 𝑥𝑗 has zero sample correlation with every other x.
•𝑅𝑗2 = 1
◦ Ruled out by MLR.3
◦ 𝑥𝑗 is a perfect linear combination of some of the x’s
•𝑅𝑗2 close to 1
◦ High correlation between 2 or more x variables - but not perfect!
◦ Multicollinearity, but NOT a violation of MLR.3
79
Example: Test Marks
𝑡𝑒𝑠𝑡𝑚𝑎𝑟𝑘 = 𝛽0 + 𝛽1 𝑛𝑢𝑚_𝑙𝑒𝑐𝑡𝑢𝑟𝑒𝑠 + 𝛽2 𝑚𝑎𝑡𝑟𝑖𝑐_𝑚𝑎𝑟𝑘 + 𝛽3 𝐺𝑃𝐴 + 𝑢
•Assume matric_mark and GPA are highly correlated,

•Can we say anything about the relationship between lecture
attendance (num_lectures) and testmark?
𝜎2
𝑆𝑆𝑇𝑗(1 − 𝑅2𝑗)
80
Multicollinearity Can be Irrelevant
• Consider this model:
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝑢
◦ Where x2 and x3 are highly correlated 𝜎2

𝑆𝑆𝑇𝑗(1 − 𝑅𝑗2 )
•Then 𝑉𝑎𝑟 𝛽መ2 and 𝑉𝑎𝑟 𝛽መ3 may be large
•But correlation between x2 and x3 has no direct effect on 𝑉𝑎𝑟 𝛽መ1
•If x1 uncorrelated with x2 and x3, 𝑹𝟐𝟏 = 𝟎
81
Multicollinearity Discussion
◦ Dropping some independent variables may reduce multicollinearity (but
might lead to omitted variable bias)
◦ NB: Only the sampling variance of the variables involved in
multicollinearity will be inflated; the estimates of other effects may be
very precise – thank goodness!
◦ NB: that multicollinearity is NOT a violation of MLR.3 in the strict sense
◦ Multicollinearity may be detected through variance inflation factors
(limited usefulness) 𝑉𝐼𝐹𝑗 = 1 2 Arbitrary rule of thumb often used: the
(1 − 𝑅𝑗 ) VIF should not be larger than 10
82
Variances in Misspecified Models
◦ We decide to include a particular variable in a regression by analyzing
the tradeoff between bias and variance
True population model
Estimated Model 1
Estimated Model 2
◦ Mis-specified Model 2: smaller variance, but probably OVB. Which do

we care more about?
83
Model 1 Model 2
•What if x2 is not relevant?
•What if x2 is relevant?
•Conclusion: it‘s weird, but from a variance perspective Model 2 is always preferred
84
• Things to note:
• Variance of underspecified model is always smaller

• A tradeoff exists between bias and variance
• Bias doesn‘t vanish in larger samples
• Conclusion: do not include irrelevant regressors (they bring high
variance)
• Small point – what about change in the error, and thus, 𝜎 2 , if we
take x2 out?
Conclusion: Do not include irrelevant regressors
85
Degrees of Freedom (dof): n-k-1
• The number of degrees of freedom is the number of values in the final
calculation of a statistic that are free to vary
• The number of independent pieces of information that go into the estimate
of a parameter
• dof of an estimate of a parameter are equal to the number of independent
scores that go into the estimate, minus the number of parameters used as
intermediate steps in the estimation of the parameter itself
• the number of dof is the number of independent observations in a sample
available to estimate a parameter of the population from which that
sample is drawn
86
Estimating the Error Variance
NB: If we don’t
divide by n-k-1,
our estimator is
•OLS residuals: biased. We don’t
prove this.
𝑢ො 𝑖 = 𝑦𝑖 − 𝛽መ0 − 𝛽መ1 𝑥𝑖1 − 𝛽መ2 𝑥𝑖2 − ⋯ − 𝛽መ𝑘 𝑥𝑖𝑘
•Degrees of freedom (df):

◦ df = n – (k+1)
◦ n = number of observations
◦ k+1 = number of estimated parameters (including intercept)
87
Theorem 3.3: Unbiased Estimation of 𝜎 2
•Standard error of the regression (SER) = 𝜎ො

◦ Also called standard error of the estimate or root mean squared error
◦ Estimator of standard deviation of the error term
•If we add another x variable, 𝜎ො can increase/decrease
◦ SSR must fall
◦ But degrees of freedom also falls by one
88
Estimating the OLS Sampling Variances
The true sampling variation of the estimated
ෞ2 for the unknown 𝜎2

Plug in 𝜎
The estimated
sampling variation
of the estimated
•Note that these formulae are only valid under MLR.1-MLR.5 (in
particular, there has to be homoscedasticity)
89
𝜎ො
Standard errors
90
Heteroskedasticity: 𝑉𝑎𝑟 𝑢 𝑥 = 𝑉𝑎𝑟 𝑦 𝐱 ≠ 𝜎 2
• Violation of assumption MLR.5

෡𝒋
• Does not cause bias in the 𝜷
෡ 𝒋)
• Does lead to bias in the usual formula for 𝑽𝒂𝒓(𝜷
◦ Invalid standard errors
• Methods available for dealing with this

◦ Chapter 8 (more advanced econometrics courses)
91
Chapter 3 Roadmap

92
Efficiency of OLS
◦ Under assumptions MLR.1 - MLR.5, OLS is unbiased
◦ However, under these assumptions there may be other unbiased
estimators
◦ Which of these unbiased estimators has the smallest variance?
◦ We limit ourselves to linear estimators, i.e. estimators linear in the
dependent variable
May be an arbitrary function of the sample values
of all the explanatory variables; the OLS estimator
can be shown to be of this form (we don‘t prove it)
93
Theorem 3.4: Gauss-Markov Theorem
◦ Under assumptions MLR.1 - MLR.5, the OLS estimators are the best
linear unbiased estimators (BLUEs) of the regression coefficients, i.e.
for all for which .
•OLS is only the best estimator if MLR.1 – MLR.5 hold; if there is

heteroscedasticity for example, there are better estimators.
94
OLS is BLUE
•Best
◦ Has the smallest variance
•Linear
◦ Can be expressed as a linear function of the data on the dependent variable
•Unbiased
•Estimator
◦ Rule that can be applied to any sample of data to produce an estimate
95
Chapter 3 Roadmap
96
The Language of Multiple Linear Regression
•NB: OLS is an estimation method, not a model (like the linear model below)
•This is a model: it describes the population, and its underlying parameters

•We can interpret 𝛽 without having estimated this equation
•IF the MLR properties hold, we can talk about the OLS estimator
•NB: There are other methods of estimating the coefficients
•Do say: I estimated the model using ordinary least squares

•Don’t say: I estimated an OLS model
97
Interpreting logs in regressions
Note also that when we have a level-log or a log-level model, we call our 𝛽1 a semi-elasticity
interpretation.
When we have a log-log model, then 𝛽1 is called an elasticity interpretation.
98

Multiple Linear Regression Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Linear Regression Model

Uploaded by

Copyright:

Available Formats

Multiple Regression Analysis: Estimation

Sharper Sikota October 2022

•Hypothesis testing: we want to answer questions

•We read the literature and think we can do it better ☺

This is where we begin in Chapter 3

𝑠ෞ𝑒(𝛽෢0 ) and ෞ ෢1 , which

•Does wage change linearly with age? Not really.

Note: There is a detailed interpretation of quadratic models

Dependent variable, explained

1. Motivation for Multiple Regression

• In the 2 variable case: 𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥1 + 𝛽መ2 𝑥2

•We have the tools – we need to differentiate it, with respect to

•The SSR has k+1 elements in it: y, and x1, …, xk

and k+1 variables: 𝛽መ0 , 𝛽መ1 ,…, 𝛽መ𝑘

•NB: We differentiate w.r.t.

•We have data for the x’s already,

By how much does y change if the j-th independent variable is

Holding age fixed, 1 more year of education is associated with another

Fitted or predicted values

•What if 𝑢ෝ𝑖 > 0?

•Similarly 𝑢ෝ𝑖 < 0 implies 𝑦𝑖 < 𝑦ෝ𝑖 , i.e. an overprediction

• Multiple regression of y on x1 and x2:

• We regress wage on age: What is the expected impact of adding education?

•R-squared NB: The R2 can never

Why is this a useful expression?

Should we keep pasta_spend in the regression?

•Assumption MLR.1 (Linear in parameters) The population model is

Each data point therefore follows the population equation

◦ MLR.3 only rules out perfect collinearity/correlation between explanatory

◦ In MLR, the zero conditional mean assumption is more likely to hold

•MLR.4 – zero conditional mean

• Unbiasedness is an average property in repeated samples

• Doesn‘t affect unbiasedness of 𝛽1 and 𝛽2 because under MLR.1-MLR.4:

𝐸 𝛽መ1 = 𝛽1 , 𝐸 𝛽መ2 = 𝛽2 and 𝐸 𝛽መ3 = 𝛽3 = 0

However, including irrevelant variables may increase

If y is only regressed If y is only regressed error term

•NB! ALL estimated coefficients will be biased

Will both be positive

The return to education 𝛽1 will be overestimated because 𝛽2 𝛿1 > 0 .

BIAS = ZERO if either of its component parts are zero.

𝛽2 > 0 Positive bias Negative bias

𝛽2 < 0 Negative bias Positive bias

• Upward bias NB: 𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝑢

𝛽1 is what you want, E(𝛽෨1)is what you get

• Biased towards zero (from either side)

True model (contains

◦ No general statements possible about direction of bias

•Recall our multiple wage regression:

•What are some omitted variables in this regression?

• Assumption MLR.5 (Homoscedasticity)

• Example: Wage equation

• Short hand notation

◦ where x = set of all independent variables (x1, …, xk)

Under Assumptions MLR.1 – MLR.5:

Total sample variation in xj: R-squared from a regression of xj on all other

The R-squared of this regression will be the higher

• 𝑅𝑗2 = 0 An example: a perfectly randomly allocated x

•Assume matric_mark and GPA are highly correlated,

◦ Where x2 and x3 are highly correlated 𝜎2

True population model

◦ Mis-specified Model 2: smaller variance, but probably OVB. Which do

•What if x2 is not relevant?

• Variance of underspecified model is always smaller

•Degrees of freedom (df):