Professional Documents
Culture Documents
(Assumptions)
Lecture 18
Remember Regression?
son_hat = 33.887 + 0.514*father OLS intercept 33.887: No
n = 1078, R2 = 0.251, s_e = 2.437
80 interpretation b/c father cannot
Height son, inches
75
be 0 inches tall
70 OLS slope 0.514: For every extra
65 1 inch of father’s height, son is
60
on average about ½ inch taller
60 𝑦 (y-hat): Predicted y, given x;
65 70 75
Height father, inches
E.g. son of a 72 inch tall father
𝑠 (s.d. of residuals) 2.437 inches: predicted to be 70.895 inches
measures scatter about OLS line (= 33.887 + 0.514*72)
𝑅 0.251: 25.1% of variation in 𝑒 (residual): 𝑒 = 𝑦 − 𝑦 ; E.g. if
sons’ heights explained by 𝑦 is 70.895 but 𝑦 is 68.531,
variation in their fathers’ heights then residual is -2.364 inches
2
ECO220Y: Lecture 18
1
Questions and Data: Still Important
• Which kind of question? • Which kind of data?
– Research question: What – Observational or
is causal effect of a experimental data
change in X (e.g. match) • Correlation ≠ causation is
on Y (e.g. amount given) a cliché
– Descriptive question: • Instead, apply
understanding of data
what are patterns in data and specific context to
(e.g. how does interpret quantitative
household spending on results
food vary with income?) – Cross-sectional, time
series, or panel data
DEFR
KW IT AU BN BS IL AD
ES PF
NC NZ BH PT CY MO
log GDP per capita
GR
KR
MT BB SI
SA OM KN
AR UY
TT LY CZ MX
HRVE SK CL LB
BWZA
TR
MY
JM MUGA
HU PA
BZLT BR
PL EE CR
LV
“Unsurprisingly, we find
8
CUDO CO
GQ FJ
GT
TNNA
THSR
DZ MK
SZEGMEIRBA PY
PE SV
RUJO
RO BG
TO higher university density
MA AL PS EC BY KZ WS
SY IQ
CG HNCV BO NI PH is associated with higher
CN LK GY
BT RS ID
CICM
GM PG TM AZUA GE AM
YE
ZW
INPK
KE
MR SN
VN SD HT
UZ
MN GDP per capita levels.”
6
BD
GN LS
NG TL
BJ MD
AO TZ LA
ZM KH
CFUG TG
MGGH KG
BF MZ RW
NP
ML
MWNE TD
SL GWLR
Why associated (not correlated)?
ET BI TJ
CD Does quote imply causality?
https://www.state.gov/s/inr/rls/4250.htm
4
0 1 2 3 4 5
log (1 + universities per million people)
Figure is from appendix of Valero and Van Reenen (2019) and includes: “Notes: Each
6
observation is a country in 2000. Source: WHED and World Bank GDP per capita”
ECO220Y: Lecture 18
2
X-variable is defined as
Log(1 + universities per million people)
• Logs can straighten curved scatter plot
– Plus one addresses countries with 0 universities
– Example 1: x-value of 1 is a country with 1.72
universities per million: ln(1 + 1.72) 1
• E.g. 10 universities w/ pop. 5.82 million: 1.72 10/5.82
– Example 2: x-value of 3 is a country with 19.09
universities per million: ln(1 + 19.09) 3
• E.g. 25 universities w/ pop. 1.31 million: 19.09 25/1.31
– University density is over 11 times bigger in Example
2, but x-value only 3 times as big (diminishing returns)
But is it a strong
correlation?
Notes: Each observation is a country. Average annual growth rates over the period
8
1960-2000 on the y axis. Source: WHED and World Bank GDP per capita
KR RUSE
SKCAJP IE
LK NL
UA AM
BE KZ
10
FJ
CU ALDE
RO
DK CY
TJ ES CH LU
FRLT
BG PL IS
HK MT MD TO LV
BW RSMY JM
ITTT AR GBGR BB
BZ AT CL
KG PA BH
HR PEUY AE
BO FI PH
SGGY BN JO CR
ZA MN MX
PT
CN ZW SA SZ DZ IR GAEC
KE GH QA LY PY MU COSV
DO MO
ZM KW TR HN BR
CG EG VETH
KHTNNA
LS CM ID NI
VNIQ
TG
5
SY TZ
BD HT
IN UG LAGT SN
PK CI MAOn average, countries with 10%
MW
CF MM MR PG CD
LR
NP SL
AF GM SD
RWBJ higher university density
YE BI
NE
(univ/million people) have 0.264
ML MZ
higher years of average schooling.
0
0 1 2 3
log (1 + universities per million people)
Notes: Each observation is a country. Source: WHED and years of schooling obtained
9
from Barro-Lee dataset
ECO220Y: Lecture 18
3
Frozen Pizza (p. 627)
• How does the volume of sales depend on the
price of frozen pizza?
– What is the economic name of this relationship?
• Weekly data on price and quantity for each of
four cities (1994 – 1996); 156 weeks
– Raw data: ch18_MCSP_Frozen_Pizza.csv
– Cross-sectional, time series, or panel?
– Are these data observational or experimental?
10
Supply Shifters
D
1
Price 2 Quantity 2 Q
P
(P) Demanded (QD) S
3 4 3
Demand Shifters
D’
Demand shifters ARE lurking/omitted/ D
unobserved/confounding variables 4 Q
Even if price unchanged,
Demand shifters for frozen pizza? D
shift directly affects Q 11
n = 156 weeks
10
• 𝑟 = −0.7697 8
• 𝑅2 = 0.5924 6
4
• Is the OLS line an 2
estimate of the demand 2 2.2 2.4 2.6 2.8 3
Price, $1s
equation?
12
ECO220Y: Lecture 18
4
Simple Linear Regression:
One x-variable
• Model: 𝑦 = 𝛽 + 𝛽 𝑥 + 𝜀
– 𝑦 : dependent var., regressand, y-var., LHS-var.
– 𝑥 : independent var., regressor, explanatory var., x-
var., RHS-var. (i.e. right-hand side variable)
– 𝑖: observation index (often 𝑖 or 𝑗 cross-sectional
data; 𝑡 time series data; 𝑖𝑡 or 𝑗𝑡 panel data)
– 𝛽 : intercept (constant) parameter
– 𝛽 : slope parameter
– 𝜀 : error term, residual, disturbance
13
Error term in 𝑦 = 𝛽 + 𝛽 𝑥 + 𝜀
𝑦
𝐸[𝑦] = 𝛽 + 𝛽 𝑥 • 𝜀 includes all other
𝐸[𝑦 |𝑥 ] factors that affect 𝑦
𝜀 aside from 𝑥
𝑦
– Impossible to collect
data on everything:
some variables
𝑥 𝑥 unobserved to the
– Line is expected value: researcher
E[𝑦 ] = 𝛽 + 𝛽 𝑥 – It reflects reality: model
– Error explains deviations cannot control for
from expectations everything
In the above graph is 𝜀 positive or negative? 14
15
ECO220Y: Lecture 18
5
Six Assumptions of
Linear Regression Model
• Book gives only four: – ECO372H: Data Analysis
– One skipped b/c obvious and Applied
Econometrics in Practice
– Another skipped b/c
only required for a – ECO374H: Forecasting
causal interpretation and Time Series
Econometrics
– To minimize confusion,
list extra two as 5 & 6 – ECO375H: Applied
Econometrics I
• Econometrics addresses – ECO475H: Applied
substantial violations of Econometrics II
assumptions
16
Assumption #1
• Regression equation is linear in the error and
parameters; the variables (in boxes) are
linearly related to each other
=𝛼+𝛽 +𝜀
– Not assuming that what is in boxes is linear (so
long as no nonlinear functions of parameters or
nonlinear functions of the error)
• Example of a linear regression: 𝑦 = 𝛼 + 𝛽𝑥 + 𝜀
• Example of a linear regression: ln 𝑦 = 𝛽 + 𝛽 𝑥 + 𝜀
17
10
1
8
0
6
4 -1
2 -2
2 2.2 2.4 2.6 2.8 3 2 3 4 5 6 7
Weekly price, $1s Q_hat
18
ECO220Y: Lecture 18
6
Natural Log Transformations
Frozen Pizza, Denver, 1994-96
ln(Q)-hat = 4.095 + -2.773*ln(P) .4
n = 156, R2 = 0.631, s.e.(b) = 0.171
ln(Weekly Q, 10,000s)
e (residuals)
2.5 .2
2 0
1.5 -.2
1 -.4
.7 .8 .9 1 1.1 2 3 4 5 6 7
ln(Weekly price, $1s) ln(Q)_hat
19
Assumption #2
• No autocorrelation / no Assumption #2 Holds
No Autocorrelation
serial correlation:
-10-5 0 5 10
residual (e)
𝐶𝑂𝑉 𝜀 , 𝜀 = 0 if 𝑖 ≠ 𝑗
– Common problem in
time-series data 0 10 20 30
t
• E.g. daily stock prices,
monthly unemployment, Assumption #2 Violated
quarterly growth data Positive Autocorrelation
residual (e)
Assumption #3
Homoscedasticity
• Homoscedasticity:
-1 0 1 2 3
𝑉 𝜀 = 𝜎 , 𝑖 = 1, … , 𝑛
Y1
– “Equal variance
assumption”
0 2 4 6 8 10
– Error 𝜀 is just as “noisy” X1
for all values of x
Heteroscedasticity
– Violation is called
3
heteroscedasticity
2
Y2
– Common problem in
1
cross-sectional data
0
0 2 4 6 8 10
X2
21
ECO220Y: Lecture 18
7
Fix Assumption #1 issues before
checking Assumption #3
Frozen
Frozen Pizza,
Pizza, Denver,
Denver, 1994-96
1994-96 3
ln(Q)-hat
Q-hat ==18.122
4.095 ++ -2.773*ln(P)
-5.280*P .4
nn == 156,
156, R2
R2 == 0.592,
0.631, s.e.(b)
s.e.(b) == 0.353
0.171 2
10,000s)
e (residuals)
Q,10,000s
2.5
10 .2
1
8
2 0
0
6
Weekly Q,
ln(Weekly
1.5 -.2
-1
4
21 -2
-.4
2.7 2.2.8 2.4 .9 2.6 12.8 1.1
3 22 33 44 55 6 7
ln(Weekly
Weekly price,
price,$1s
$1s) ln(Q)_hat
Q_hat
Assumptions #4 & #5
Galton's Heights, n = 1078
80
• Galton’s data (Lec. 5)
Son, inches
75
– Assumptions 1-3 hold? 70
65
• Normality: 𝜀 is Normal
60
– 𝜀 is unobserved so 60 65 70 75
check 𝑒 = 𝑦 − 𝑦 Father, inches
𝐸 𝜀 = 0, 𝑖 = 1, … , 𝑛 .15
Density
Graphical Summary
𝑦 Would the elements in the population
(not shown) lie on the line?
𝐸[𝑦] = 𝛼 + 𝛽𝑥
𝐸[𝑦|𝑥 = 𝑥 ]
𝐸[𝑦|𝑥 = 𝑥 ] 𝑁(𝛼 + 𝛽𝑥 , 𝜎 )
𝐸[𝑦|𝑥 = 𝑥 ] 𝑁(𝛼 + 𝛽𝑥 , 𝜎 )
𝑁(𝛼 + 𝛽𝑥 , 𝜎 ) Is a reflection of
sampling error?
𝑥 𝑥 𝑥 𝑥
Assumptions #3, #4, and #5 combined: 𝜀 ~ 𝑁(0, 𝜎 ) 24
ECO220Y: Lecture 18
8
2017 ON Public Sector Waterloo, 2016 Salaries
400
Assumption #3 violated?
100
Assumption #4 violated? 0 .2 .4 .6 .8 1
Male
25
Assumption #6:
Exogeneity Assumption
• x uncorrelated w/ error: 𝐶𝑂𝑉 𝑥 , 𝜀 = 0
– Exogeneity: x variable(s) unrelated with error
• Dosage is exogenous: 𝑆𝑙𝑒𝑒𝑝 = 𝛼 + 𝛽𝑑𝑜𝑠𝑎𝑔𝑒 + 𝜀
• Experimental data can est. causal effect: 𝐸 𝑏 = 𝛽
– Endogeneity: x variable(s) related with error
• With observational data, lurking/unobserved/omitted/
confounding variables mean x and error are related
• Price of pizza is endogenous: 𝑄 = 𝛽 + 𝛽 𝑃 + 𝜀
• Endogeneity bias means: 𝐸[𝑏 ] ≠ 𝛽
In estimating 𝑆𝑎𝑙𝑎𝑟𝑦 = 𝛽 + 𝛽 𝑀𝑎𝑙𝑒 + 𝜀 with 𝑛 = 1,357
Waterloo employees, is 𝑀𝑎𝑙𝑒 endogenous? 26
“Short-Hand” Assumptions
1) Linear relationship between variables
(possibly non-linearly transformed)
2) No correlation amongst errors (no
autocorrelation for time-series data)
3) Homoscedasticity (single variance) of errors
4) Normally distributed errors
5) Constant included (error has mean 0)
6) No relationship between x and error
27
ECO220Y: Lecture 18
9