You are on page 1of 101

lOMoARcPSD|23815106

ECMT1020 Notes

Sean Gong

July 10, 2020

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 1: Summarising Uni-


variate Data

1.1 Types of data


1.1.1 Classifying data
Dataset: records the values of variables on several cases

• Variable: Characteristic of interest

• Case: Unit for which we measure a variable

Data classification criteria:

• Variable number: univariate, bivariate, multivariate

• Variable type: quantitative, categorical

• Case type: cross-section, time series (esp. used in economics), panel data

Different statistical techniques are used for each data type.

1.1.2 Types of variables


Numerical data

• Continuous: Can be arbitrarily precise

• Discrete: Cannot take arbitrarily precise values

Categorical data

• Not naturally numbers

NOTE: Numbers can be used to represent categorical data

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

4 CHAPTER 1. WEEK 1: SUMMARISING UNIVARIATE DATA

1.1.3 Types of cases


• Cross-section data: Each cases is a separate ’individual’ (e.g. 2019 GDP
data for all OECD countries)

• Time series data: Same ’individual’ at different points in time (e.g. Aus-
tralian GDP growth 2000-2009)

• Panel data: Combination of cross-section and time series; A characteristic


recorded for different individuals at different points in time (e.g. GDP
data for all OECD countries in 20th century)

1.1.4 Notation
Example: Suppose we have univariate data on x (e.g. x is annual GDP growth

• Cross section: n data points

– Each case is labelled xi for i = 1, 2, 3, ..., n


– Usually arbitrary (e.g. x1 = Albania, x2 = Afghanistan, etc.)

• Time series: t data points

– Each case is labelled xt


– (e.g. x1 is GDP growth for 1970, x2 for 1971)

• Panel data: n × t data points

– Data points labelled xnt

Multivariate cases

• Bivariate: effect of explanatory variable x on response variable y E.g.


x = education level, y = annual income

• Multivariate: effect of explanatory variables x1 , x2 , ... on y E.g. x1 =


Chinese GDP growth, x2 = Australian inflation, y = Australian unem-
ployment rate

1.2 Summarising data


1.2.1 Statistics Recap
Statistics: using data to understand a parameter we cannot observe
Statistical model: using data to understand a population we cannot observe
E.g. Every human being on Earth
Statistics summary:

1. Assume dataset is a sample of the population

2. Calculate estimate of population parameter from sample

• Convention: Greek letters for population parameters, Latin letters


for sample estimates

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

1.2. SUMMARISING DATA 5

3. Inference to say something about the estimated parameter (e.g. confi-


dence intervals, hypothesis tests, etc.)

Data analysis summary:

1. Summary: Summary statistics and graphs to illumate essential features


of the data

2. Inference: What do these numbers tell about the parameters we are trying
to estimate?

3. Interpretation: What is the economic meaning?

1.2.2 Summary statistics


4 summary statistics - measures of:

1. Central tendency E.g. Average, median

2. Dispersion E.g. SD

3. Asymmetry E.g. Skew

4. Kurtosis E.g. Outliers

Central tendency
Sample mean
n
1!
x̄ = xi (1.1)
n i=1

Stata: mean [data], tabstat [data], or stat(mean)

Sample median Median: Middle observation

• Advantage: Not prone to outliers

• Disadvantage: Not sensitive to outliers

Stata: summarize [data], tabstat [data], or stat(median)

Sample mode Mode: Most frequently occuring observation

• Not very useful for continuous data; can be made artificially discrete (by
grouping)

Sample midrange Midrange: Average of min and max observations

• Very sensitive to outliers; essentially useless

Dispersion
Quantiles Quantiles: Special type of percentiles

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

6 CHAPTER 1. WEEK 1: SUMMARISING UNIVARIATE DATA

Sample variance
n
1 !
s2 = (xi − x̄)2 (1.2)
n − 1 i=1

NOTE: n - 1 is the degrees of freedom; not n because we are using sample mean
x̄ not population mean µ

• Degrees of freedom: number of values involved in calculations that have


the freedom to vary

• Total number of observations - number of independent constraints on ob-


servations

Standard deviation

2
s= s (1.3)
68-95-99.7 rule

Coefficient of variation
s
cv = (1.4)

• Advantage: unit-free ⇒ can be compared across different variables

• Disadvantage: nobody uses it

Range
range = min − max (1.5)

Interquartile range

IQR = 3rdquartile − 1stquartile (1.6)

Symmetry
Symmetry: Similarity when reflected about the median
n
1 ! xi − x̄ 3
Skew(x) = ( ) (1.7)
n i=1 s

Kurtosis
Kurtosis: fatness of the tails; how much frequency is distributed to the tails
n
1 ! xi − x̄ 4
Kurt(x) = ( ) (1.8)
n i=1 s

x normally distributed ⇒ Kurt(x) = 3

• Excesskurtosis = Kurt(x) − 3

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

1.2. SUMMARISING DATA 7

1.2.3 Graphs
Box plot

Box plot indicates:

• Minimum

• Maximum

• Quartiles (incl. median)

• Skewness

Note: Outliers are above U pperquartile + 1.5 ∗ IQR

Histogram

Histogram: Shows how often certain values occur

• Primarily useful for cross-section data

• Not as useful for time-series // no info on when values occurred

• y-axis: frequency or percentages

– Latter more useful with large datasets

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

8 CHAPTER 1. WEEK 1: SUMMARISING UNIVARIATE DATA

Histograms for continuous variables:


• Each value has frequency of 1 ⇒ directly
√ using histogram not useful ⇒
group observations in bins (usually n bins) ⇒ makes data discrete
• Alternatives:
– Smoothed histogram
– Kernel density distribution

Line graphs
Line graph: plots observations against observation number
• Shows how values changing over time
• Only useful where x-axis is variable with natural ordering
Stata: tsline

Categorical data
Options:
• Frequency table
• Bar chart

• Pie chart

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 2: Probability Recap,


Inference on Means

2.1 Probability recap


2.1.1 Experiments and random variables
Experiment: An operation whose outcome cannot be predicted with certainty
Random variable: Variable whose value depends on the outcome of the experi-
ment

• E.g. Coin flipping: Random variable is no. of heads

• Notation:

– Random variable: X (Upper case)


– Specific value/outcome: x (lower case)
∗ One realisation of X is x
– Probability Mass Function (PMF) of X: Pr[X = x]
i.e. probability of X taking value x
– Cumulative Distribution Function: Pr[X ≤ x]

Case study: Filling a coin 2 times, random variable = no. of heads


1
• 4 outcomes: TT, TH, HT, HH; each probability 4

• Values of X: 0, 1, 1, 2 respectively
PMF:
1
– Pr[X = 0] = 4
1 1 1
– Pr[X = 1] = 4 + 4 = 2
1
– Pr[X = 2] = 4

CDF:
1
– Pr[X ≤ 0] = 4
1 1 1 3
– Pr[X ≤ 1] = 4 + 4 + 4 = 4
3 1
– Pr[X ≤ 2] = 4 + 4 = 1

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

10CHAPTER 2. WEEK 2: PROBABILITY RECAP, INFERENCE ON MEANS

2.1.2 Discrete vs. continuous


Random variables can be:

• Discrete: X takes on discrete values (e.g. coin flipping)

• Continuous: X take on uncountable no. of values

CDF: Generalises nicely (i.e. applicable to both discrete/continuous)


PMF: Does not generalise // for continuous, P[Z = z] = 0 (uncountable)
⇒ Probability Density Function (PDF): derivative of the CDF

• If normal distribution, PDF is bell curve

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

2.1. PROBABILITY RECAP 11

2.1.3 Expected value


Expected value: Weighted average of all x that X can take, weights determined
by probability
!n
E[X] = xi · P r[X = xi ] (2.1)
i=1

E.g. Coin flipping example:

1 1 1
E[X] = 0 · +1· +2· =1 (2.2)
4 2 4
⇒ EV = 1 head flip
This is a population quantity ⇒ E[X] is the population mean of X, denoted µ

2.1.4 Population variance


Sample variance: average of (xi − x̄)2 (divide by n-1)
Population variance: average of (X − µ)2
n
!
σ 2 = V ar[X] = E[(X − µ)2 ] = (xi − µ)2 · P r[X = xi ] (2.3)
i=1

NOTE: Similarity between EV and variance


EV: Weighted average of all values X can take Variance: Weighted average of
all squared difference between random variable and EV
Coin flipping example:

1 1 1
σ 2 = (0 − 1)2 · + (1 − 1)2 · + (2 − 1)2 · (2.4)
4 2 4
" 1
P opulationSD = σ = 1/2 = √ (2.5)
2

Population standard deviation of X: σ2 = σ

NOTE: Skewness and kurtosis have population analogues (not in ECMT1020)

2.1.5 Linear transformations of random variables


Linear transformation: Finding Y = a + bX where we know the mean and
variance of X
E.g. Celsius → to Fahrenheit

• Mean of Y = a + bµ

• Variance of Y = b2 σ 2

• SD of Y = b2 σ 2 =| b | ·σ
NOTE: b ∈ R, σ = SD ≥ 0

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

12CHAPTER 2. WEEK 2: PROBABILITY RECAP, INFERENCE ON MEANS

2.1.6 Standardisation
X−µ
σ always has SD 1 and mean 0 // linear transformation
X−µ
• Y = σ ,Y = a + bX
−µ 1
• a= σ ,b = σ
−µ µ
• Mean of Y = a + bµ = σ + σ =0
1
• SD of Y =| b | ·σ = σ ·σ =1
Y is the standardised form of X

2.1.7 Linear combination


Linear combination: aX + bY where X, Y are random variables, a, b are non-
random constants
E[ax + bY ] = a · E[X] + b · E[Y ] (2.6)
X and Y independent (IMPORTANT: make sure no correlation!)⇒
V ar[aX + bY ] = a2 V ar[X] + b2 V ar[Y ] (2.7)
We obtain the above result from the fact that
V ar[aX + bY ] = a2 V ar[X] + 2Cov(X, Y ) + b2 V ar[Y ] (2.8)

2.2 Sampling distributions


Sample statistics should give information about population statistics

x̄ ⇒ µ, s2 ⇒ σ 2 (2.9)
KEY: Sample statistics are themselves random variables (i.e. they are real-
isations of X)

Sample statistics have a distribution

2.2.1 Observations are realisations of random variables


Identically distributed: Distribution of sample random variables same as
distribution of whole population
Independent: Two observations are independent if knowing one does not give
information about the other

2.2.2 The sample mean as a random variable


1
Sample mean: x̄ = n (x1 + x2 + ... + xn )

1
It makes sense to think of it as: X̄ = n (X1 + X2 + ... + Xn )

x̄ is a realisation of X̄

Study distribution of X̄ → quantify what x̄ tells us about µ

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

2.2. SAMPLING DISTRIBUTIONS 13

2.2.3 The distribution of X̄


Assumptions:

• X1 , X2 , ..., Xn are independent and identically distributed (iid)

• X1 + X2 + ... + Xn is selected from normal distribution with mean µ


and variance σ 2

– Central Limit Theorem: Sum of random variables will be nor-


mally distributed, even if the variables themselves are not

We want to find µ without knowing µ or σ 2

Expected value of X̄
Given that:
1
• X̄ = n X1 + n1 X2 + ... + n1 Xn

• E[X1 ] = µ, E[X2 ] = µ, ..., E[Xn ] = µ

Then applying linear combinations:


1 1 1 1
E[X̄] = · µ + · µ + ... + · µ = n · · µ = µ (2.10)
n n n n
We say that the sample mean X̄ is an unbiased estimator of the population
mean µ

Variance of X̄

1 1 1 1 1 1 1 σ2
V ar[X̄] = V ar[ X1 + X2 +...+ Xn ] = 2 ·σ 2 + 2 ·σ 2 +...+ 2 ·σ 2 = n· 2 ·σ 2 =
n n n n n n n n
(2.11)
SD of X̄ is √σn

• Variance reduces as sample size grows

2.2.4 Statistical properties of X̄


Sample mean X̄ properties:

• X̄ is unbiased estimator: E[X̄] = µ

• X̄ is consistent estimator: distribution more concentrated around µ as


sample size grows
For every constant c > 0:

lim P r[µ − c ≤ X¯n ≤ µ + c] = 1 (2.12)


n→∞

• X̄ has the minimum variance of all unbiased estimators of µ

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

14CHAPTER 2. WEEK 2: PROBABILITY RECAP, INFERENCE ON MEANS

2.3 Inference on the mean


2.3.1 The z statistic
2
σ2
X̄ ∼ N (µ, σn ): X̄ has normal distribution with mean µ and variance n

The z statistic: difference between sample mean and population mean as pro-
portion of SD; how many SDs away from the population mean?

X̄ − µ
z= (2.13)
√σ
n

NOTE: Z ∼ N (0, 1)

2.3.2 Standard error of the sample mean


s2 is realisation of random variable S 2 with expected value σ 2

We use s to estimate σ

SD of X̄ estimated as √sn

√s is the standard error of the sample mean


n

2.3.3 The t statistic


X̄ − µ
t= (2.14)
√S
n

T does not have a normal distribution. Instead, the t-statistic approximates:


• t distribution with n-1 degrees of freedom
• t distribution approaches normal as n increases
The t-statistic is exactly T distributed when:
• n→∞
• X ∼ N (µ, σ 2 )

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

2.3. INFERENCE ON THE MEAN 15

2.3.4 Confidence intervals

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

16CHAPTER 2. WEEK 2: PROBABILITY RECAP, INFERENCE ON MEANS

2.3.5 Hypothesis testing


Steps:

• Null hypothesis H0 : claim to be rejected - generally µ = µ∗


Alternative hypothesis Ha : alternative to claim - generally µ ,= µ∗

• Basic idea: Assuming H0 is true, how unlikely is our sample estimate?

• Significance level α: cut-off point between extremely unlikely and not


unlikely

The p value

p-value: probability of sample estimate if H0 is true

• Reject H0 if p-value < α

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

2.3. INFERENCE ON THE MEAN 17

Critical values
Critical region: Values of t such that H0 should be rejected

• Thus probability of being in critical region is α

• Critical region: |t| > tn−1,α/2

• Critical value: boundary value tn−1,α/2

2.3.6 Type I and Type II Errors


Type I Error: Reject true H0
Type II Error: Do not reject false H0

We can control for these errors by adjusting α

Trade-off: decreasing Pr[Type I Error] increases Pr[Type II Error], vice versa.

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

18CHAPTER 2. WEEK 2: PROBABILITY RECAP, INFERENCE ON MEANS

This occurs when we increase α. The inverse of this is also true.

NOTE: Because we can adjust α, the probability of Type I and II Er-


ror is completely under the control of the statistician.

α is the maximum probability of Type I error.

We also have, in Colin Cameron’s textbook, the following definitions:

• Test size: Pr[Type I error] = α


• Test power: 1 - Pr[Type II error]

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 3: More hypothesis tests,


Data transformations

3.1 Further hypothesis testing


3.1.1 2-sided vs. 1-sided tests
2-sided tests: do not care about direction of effect (larger/smaller)
• Usually used to answer question ”does X have any influence on Y?”
• Reject H0 if sample mean much larger OR smaller
1-sided tests: care about direction of effect
• Reject H0 only if smaller OR larger

3.1.2 1-sided hypotheses


2 options for 1-sided test:
1. H0 : µ ≤ µ∗ vs Ha : µ > µ∗
2. H0 : µ ≥ µ∗ vs Ha : µ < µ∗
2-sided example: claim = ”Annual average salary is $40000”
1. H0 : µ = 40000
2. Ha : µ ,= 40000
IMPORTANT: Here, rejecting H0 means ”claim is wrong” (i.e. H0 is the
claim)

1-sided example: claim = ”Annual average salary is above $40000”


1. H0 : µ ≤ 40000
2. Ha : µ > 40000
IMPORTANT: Here, rejecting H0 means ”claim is right” (i.e. Ha is the claim)
COMMENT: Why is it that H0 serves different purposes for 1-sided
and 2-sided tests?

19

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

20CHAPTER 3. WEEK 3: MORE HYPOTHESIS TESTS, DATA TRANSFORMATIONS

3.1.3 p-values and critical regions for 1-sided tests


Consider:

1. H0 : µ ≤ 40000

2. Ha : µ > 40000

2 approaches (same as 2-sided tests):

1. p-value approach

• More unlikely x̄ = more consistent with Ha

• Likelihood of x̄ measured by p-value from t statistic

• In this case, p-value is P R[Tn−1 ≥ t]; P R[Tn−1 ≤ t] if opposite case

• 2-sided: P R[|Tn−1 | ≥ |t|]

2. Critical value approach

• 2-sided: Reject for very large AND small values of t

• 2-sided critical region: |t| > Tn−1,α/2

• 1-sided: Reject for very large OR small values of t

• 1-sided critical region: t > Tn−1,α OR t < −Tn−1,α

– Same inequality sign at Ha

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

3.1. FURTHER HYPOTHESIS TESTING 21

3.1.4 t-tests in Stata


Stata: ttest reports both 1-sided AND 2-sided

i.e. ttest earnings==40000 returns p-values for 3 tests:


1. H0 : µ = 40000 vs Ha : µ ,= 40000
2. H0 : µ ≤ 40000 vs Ha : µ > 40000
3. H0 : µ ≥ 40000 vs Ha : µ < 40000

3.1.5 Hypotheses concerning other parameters


To test other parameters (e.g. diff. between 2 means, proportion) we can either:
1. Follow same procedure of t test for µ
2. Use Central Limit Theorem:
estimator − parameterV alueU nderH0
t= (3.1)
SEof Estimator
approx. follows T distribution, where df = n − estimatedP arameters
(estimated parameters excl. SE)

Confidence intervals and hypothesis tests


CLT means normal procedure for confidence intervals

100(1-α)% confidence interval for a parameter:


estimate ± tdf,α/2 × SE (3.2)

Hypothesis tests still same: compute t statistic, reject H0 if pV al < α

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

22CHAPTER 3. WEEK 3: MORE HYPOTHESIS TESTS, DATA TRANSFORMATIONS

Difference in means
Example: Male average annual salary higher than female average annual salary?

We have:

• 2 independent random variables

1. Male ave. ann. salary: X1 ∼ N (µ1 , σ12 )


2. Female ave. ann. salary: X2 ∼ N (µ2 , σ22 )

• Hypotheses

– H0 : µ 1 − µ 2 ≤ 0
– Ha : µ 1 − µ 2 > 0

• Independent random samples

– x1,1 , x1,2 , x1,3 , ..., x1,n1


– x2,1 , x2,2 , x2,3 , ..., x2,n2

t statistic:

• µ1 − µ2 estimated by: x̄1 − x̄2


#
2 2
– se(x̄1 − x̄2 ) = sn11 + sn22

• n1 + n2 observations, 2 estimated means HENCE df = n1 + n2 − 2

Hence:
(X̄1 − X̄2 ) − (µ1 − µ2 )
t= # (3.3)
s1 2 s2 2
n1 + n 2

Example:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

3.2. UNIVARIATE DATA TRANSFORMATION 23

Proportions
Example: Proportion of loans in default

A proportion is a mean: mean of a variable which can take on the value


of either:

• 1: if the thing we wish to measure occurs (defaulted loan)

• 0: if the thing we wish to measure doesn’t occur (no default)

It is a mean, so the usual t test should work

Let population proportion be p, sample proportion be p̂:

• H0 : p = p∗ vs Ha : p ,= p∗
#
p∗ (1−p∗ )
• SE of p̂: n

Variance is known under H0 ⇒ we can use the normal distribution instead


of t distribution

NOTE: hard to estimate populations close to 1 or 0 ⇒ we need:

np, n(1 − p) > 10 (3.4)

3.2 Univariate data transformation


3.2.1 Logarithms recap

A logarithm is an exponent:

• E.g. 103 = 1000, log10 1000 = 3

Natural logarithm uses constant e as a convenient base


Exponential function ex often written as exp(x)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

24CHAPTER 3. WEEK 3: MORE HYPOTHESIS TESTS, DATA TRANSFORMATIONS

3.2.2 Linearising exponential growth


Linear approximation to the natural logarithm:

Usefulness of logarithms/exponentials

• Many economic variables are exponential

• Our models are linear

• Logarithms/exponentials allow transformations between the two

Example: a0 dollars in an account with annual interest rate of 100r

at = a0 (1 + r)t ⇒ ln(at ) = ln(a0 ) + tln(1 + r) (3.5)

The exponential function at is now a linear function ln(at )

Example: Australian GDP transformation

3.2.3 Proportionate changes


Logarithms are useful in approximating proportionate change:

• Change in variable: ∆x = xt+1 − xt

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

3.2. UNIVARIATE DATA TRANSFORMATION 25

∆x
• Proportionate change: xt

∆x
• Logarithmic transformation: ∆ln(x) ≈ xt

3.2.4 Eliminating right skewness


Logarithms can reduce skewness; most economic variables are right-skewed:
• ln(x) highly steep for small x, flat for large x
• THUS inflates small x, reduces large x
• Brings closer to normal distribution

COMMENT: Will analyzing transformed data lead to biases; is it


still reliable?

3.2.5 Other useful transformations


• Adjusting by price level to get real values
– Given a base year B:
priceB
realGDPt = nomGDPt × (3.6)
pricet

• Adjusting by size of population to get per capita values


– Given the population:
realGDPt
perCapRealGDPt = (3.7)
populationt

3.2.6 Moving averages


Moving average (MA) smooths fluctuations and highlights long-term pat-
terns
1. simple MA: averages over previous recent observations
1
x̃t = · (xt−4 + xt−3 + xt−2 + xt−1 + xt ) (3.8)
5

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

26CHAPTER 3. WEEK 3: MORE HYPOTHESIS TESTS, DATA TRANSFORMATIONS

2. centred MA: averages over surrounding observations


1
x̃t = · (xt−2 + xt−1 + xt + xt+1 + xt+2 ) (3.9)
5

Simple MA more popular // can be calculated with time t

NOTE: 5 is not a fixed constant

Deseasonalising data (12-month MA):

3.2.7 Growth rates


xt −xt−1
Growth rate is a proportionate change xt−1

2 reporting options for quarterly/monthly data:


1. Annualised growth rate: extrapolates current quarter growth to a while
year:
xt − xt−1
4× (3.10)
xt−1

2. Year-on-year growth rate: compares current quarterly growth to same


quarter last year:
xt − xt−4
(3.11)
xt−4

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 4: Bivariate data, least


squares regression

4.1 Notation
Bivariate data: 2 variables

• We have (x1 , y1 ), (x2 , y2 ), ..., (xn , yn )

• We assue X influences Y

4.2 Cross Tables


Cross Table: summarises relationship between 2 discrete variables

• Cross tables not useful for continuous variables

It is common to include:

• Row percentages summing to 100% for each row

• Column percentages summing to 100% for each column

1. Row percentages 2. Column percentages

4.3 Scatter Plots


Scatter Plot: summarises relationship between 2 continuous variables

27

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

28CHAPTER 4. WEEK 4: BIVARIATE DATA, LEAST SQUARES REGRESSION

4.4 Bivariate distributions


Univariate case: We need to know Pr[X = x] Bivariate case: We need to know
Pr[X = x, Y = y]
• NOTE: Not just Pr[X = x], Pr[Y = y]
Example: Gender and voting behaviour

• Vertical total of rows: Shows gender distribution irrespective of political


affiliation = Pr[X = x] = marginal distribution of X
• Horizontal total of columns: Shows voting distribution irrespective of gen-
der = Pr[Y = y] = marginal distribution of Y
Conditional probabilities: Probability of Democrat given female and vice
versa?
P r[f emale, Democrat] 0.22
P r[Democrat|f emale] = = ≈ 0.431 (4.1)
P r[f emale] 0.51
Similarly:
P r[f emale, Democrat] 0.22
P r[f emale|Democrat] = = ≈ 0.564 (4.2)
P r[Democrat] 0.39

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

4.5. BAYES’ RULE 29

4.5 Bayes’ Rule


Generalising the previous results:

P r[X = x, Y = y]
P r[X = x|Y = y] = (4.3)
P r[Y = y]

Also:
P r[X = x, Y = y]
P r[Y = y|X = x] = (4.4)
P r[X = x]
Combining, we get Bayes’ Rule:

P r[Y = y|X = x] P r[X = x|Y = y]


= (4.5)
P r[Y = y] P r[X = x]

4.6 Independence
Dependence: Knowing value of one variable changes probability distribution of
other variable Independence: Knowing value of one variable does not change
probability distribution of other variable, i.e.:

P r[X = x|Y = y] = P r[X = x] (4.6)

And (from equations 4.3, 4.4):

P r[X = x, Y = y] = P r[X = x] · P r[Y = y] (4.7)

NOTE: It is useful to think of it as flipping a coin

4.7 Conditional means and variances


Conditional distributions are univariate distributions!

HENCE: Conditional means and conditional variances are computed in


same way as univariate means and variances

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

30CHAPTER 4. WEEK 4: BIVARIATE DATA, LEAST SQUARES REGRESSION

4.8 Covariance
Covariance: scaled measure of linear dependence
σXY = E[(X − µx )(Y − µy )] (4.8)
• σXY > 0: X and Y either both large or both small
• σXY < 0: One large, other small
• σXY = 0: X and Y are independent ⇒ σXY = 0
NOTICE THIS IS A ONE-WAY STATEMENT
2
NOTE: V ar[X] = σX = σXX
NOTE: Covariance only measures linear dependence (i.e. σXY ≈ 0 even when
non-linear relationship NOTE: Covariance is NOT scale-free (i.e. X in metres
instead of km increases σXY by 1000)

Sample covariance:
n
1 !
sXY = (xi − x̄)(yi − ȳ) (4.9)
n − 1 i=1

4.9 Correlation
Correlation: scale-free measure of linear dependence
σXY
ρXY = (4.10)
σX · σY
Sample correlation:
sXY
rXY = (4.11)
sX · sY
Correlation: the no. of SDs that y changes by when x changes by 1 SD

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

4.10. REGRESSION 31

Due to formula of correlation, points in quadrants have specific signs:

• rXY > 0: More in positive quadrants than negative quadrants

• rXY < 0: More in negative quadrants than positive quadrants

4.10 Regression
Regression: Finding line of best fit for scatter plot

i.e. find values b1 , b2 such that

ŷi = b1 + b2 xi (4.12)

is as close to yi for all i = 1, 2, ..., n as possible

• y is dependent variable, x is independent variable

• ŷ are fitted values (y values when subbed into straight line formula)

• b1 is intercept, b2 is slope

Residual: Distance between line and value

ei = yi − ŷi (4.13)

We want to minimise the residual when constructing our regression line

4.11 Least Squares Regression


How to make ei small:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

32CHAPTER 4. WEEK 4: BIVARIATE DATA, LEAST SQUARES REGRESSION

$n
1. Minimise i=1 ei : Does not work as positive and negative residuals
cancel out
$n
2. Minimise i=1 |ei |: Mathematically complicated and has poor sta-
tistical properties
$n 2
3. Minimise i=1 ei : THIS WORKS; Least squares regression

We are trying to solve:


n
! n
! n
!
min ei 2 = min (yi − yˆi )2 = min (yi − b1 − b2 xi )2 (4.14)
b1 ,b2 b1 ,b2 b1 ,b2
i=1 i=1 i=1

Using calculus, the following optimisations are solved:


$n
(x − x̄)(yi − ȳ)
$n i
b2 = i=1 2
, b1 = ȳ − b2 x̄ (4.15)
i=1 (xi − x̄)

1
Note that for b2 , when multiplied by n−1 , the numerator is covariance, the
denominator is variance. Hence, we get:
sxy sxy · sy sy
b2 = 2
= = rxy · (4.16)
sx sx · sx · sy sx

4.12 The meaning of b1 and b2


b1 : The intercept

• The fitted value of y for x = 0

• NOTE: Sometimes doesn’t make economic sense

b2 : The slope

• Predicted change in y given a unit change in x


dŷ
• b2 = dx

4.13 Point forecasts


Sub in x (independent variable) values to get predicted y (dependent variable)
values

4.14 Correlation vs. Causation


We usually assume X causes changes in Y , but this may not be true

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

4.15. STANDARD ERROR OF REGRESSION 33

4.15 Standard error of regression


We want to see how ’good’ the regression line is. We look at the variance of ei :
n
1 !
(ei − ē)2 (4.17)
n − 1 i=1

Obviously, ē = 0. Also, we have estimated 2 coefficients b1 , b2 , so we actually


divide by n − 1, not n − 2. Hence, the standard error of regression (se ):
n
1 ! 2
se 2 = ei (4.18)
n − 2 i=1

Also known as root mean squared error

4.16 Sums of squares


3 measures to understand relationship between X and Y:

1. Total sum of squares (TSS): total variation in Y


n
! T SS
T SS = (yi − ȳ)2 , = V ar[y] (4.19)
i=1
dfT SS

2. Explained sum of squares (ESS): Variation explained by model


n
! ESS
ESS = (yˆi − ȳ)2 , = V ar[ŷ] (4.20)
i=1
dfESS

3. Residual sum of squares (RSS): Variation not explained by model


n
! RSS
RSS = (yi − yˆi )2 , = V ar[e] = s2e (4.21)
i=1
dfRSS

NOTE: T SS = ESS + RSS

4.17 Coefficient of determination


Coefficient of determination (R2 ): Proportion of total variation explained
by the model
ESS 2
R2 = , R ∈ [0, 1] (4.22)
T SS
Higher R2 means model fits better and vice versa. It turns out that:

R2 = rxy
2
= ry2ŷ (4.23)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

34CHAPTER 4. WEEK 4: BIVARIATE DATA, LEAST SQUARES REGRESSION

4.18 Statistical assumptions


4.18.1 Assumption 1: Linear model
Assume that, for all i:
yi = β 1 + β 2 x i + u i (4.24)
where β1 , β2 are fixed population parameters, ui is an error term

That is, Y depends linearly on X in the population


NOTE: We don’t say anything about X, only Y’s dependence on X

4.18.2 Assumption 2: Mean independence


Assume that, for all i:
E[ui |xi ] = 0 (4.25)
That is, for any particular xi , ui = 0 on average
NOTE: This may be false if there is a pattern in ui values

4.18.3 Implication of Assumptions 1 and 2


Implication: The expected of yi given xi is β1 + β2 xi , i.e.:

E[yi |xi ] = β1 + β2 xi (4.26)

The population regression line is just E[y|x]

4.18.4 Assumption 3: Homoskedasticity


Assume that, for all i:
V ar[ui |xi ] = σu 2 (4.27)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

4.18. STATISTICAL ASSUMPTIONS 35

That is, the variance of the error term as x changes is constant. Otherwise,
heteroskedastic, e.g.:

4.18.5 Assumption 4: Independence

Assume that, for all i, j where i ,= j:

ui , uj independent (4.28)

That is, where yi is relative to the regression line does not affect where yi+1 will
be relative to the line.

This is often not true for time-series data (e.g. if unemployment is high this
year, it will likely also be high next year):

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

36CHAPTER 4. WEEK 4: BIVARIATE DATA, LEAST SQUARES REGRESSION

4.18.6 Assumption 5: Sample assumptions


Assume that:

1. Sample variance of x is non-zero

2. There are more than 2 data points

4.18.7 Implication of assumptions


From Assumptions 1 & 2: Proves the least squares estimators are unbiased,
i.e.:
E[b1 ] = β1 , E[b2 ] = β2 (4.29)
From Assumptions 3 & 4:

We need to estimate σu 2 .
E[se 2 ] = σu 2 . Hence, replace σu 2 with se 2 .

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

4.19. HYPOTHESIS TESTING 37

Coefficient standard errors se(b1 ), se(b2 ): Roots of V ar[b1 ], V ar[b2 ] when


replace σu 2 with su 2

4.19 Hypothesis testing


4.19.1 Normality and t statistics
Assume ui is normally distributed, so b1 , b2 are normally distributed too. Al-
ternatively, rely on CLT.

Similarly to the univariate case, the t statistic for b2 is:


b2 − β 2
∼ tn−2 (4.30)
σb2

4.19.2 Significance tests


Generally, we have:
H0 : β2 = 0, H1 : β2 ,= 0 (4.31)
where rejecting H0 means X explains some variation in Y. We call this test the
coefficient significance test.

4.19.3 Confidence intervals


Confidence interval for b2 :
b2 ± tn−2 · se(b2 ) (4.32)

4.20 Optimality properties


The estimators b1 , b2 are the best linear unbiased estimators:

• They are unbiased


• Variance approaches 0 as n → ∞, so consistent
• By Assumptions 1-4: b1 , b2 have smallest variance among all possible un-
biased estimators that are linear functions of y

We call this result the Gauss-Markov theorem: our OLS estimators are
BLUE (Best Linear Unbiased Estimators).

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

38CHAPTER 4. WEEK 4: BIVARIATE DATA, LEAST SQUARES REGRESSION

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 5: Inference for bivari-


ate regression

5.1 Robust Standard Errors


5.1.1 Relaxing the assumptions
Some results require Assumptions 3 and 4. If either assumption is wrong, the
variance of the estimators will also be wrong. Our default standard errors are
no longer applicable. To perform inference on b2 , we must use robust stan-
dard errors.

3 cases where robust standard errors must be used:


1. Robustness to heteroskedasticity
2. Robustness to autocorrelation
3. Robustness to clustering

5.1.2 Heteroskedasticity
Heteroskedasticity: Drop Assumption 3
• V ar[ui |xi ] is not constant at σu 2 for each i.
• The formula for se(b2 ) is incorrect as it depends on σu 2 , which no longer
exists.

Adjusting standard errors (NOTE: not examinable):


• White (1980)
• Even is Assumption 3 is false, E[ui |xi ] = 0
• Hence, let V ar[ui |xi ] = E[ui 2 |xi ], i.e. the individual ei 2 rather than the
common se 2
Heteroskedasticity-robust standard errors
• Robust standard errors larger → t statistics smaller, p-values larger →
rejecting H0 harder
• Stata: regress y x, robust adjusts standard errors

39

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

40 CHAPTER 5. WEEK 5: INFERENCE FOR BIVARIATE REGRESSION

5.1.3 Autocorrelation
Autocorrelation: Drop Assumption 4

• ui and uj not independent

• Very common for time-series data

Adjusting standard errors (NOTE: not examinable):

• Newey and West (1987)

• Estimate Cov[ui , uj ] by ei ej

• We only make this estimate for i, j close together: |i − j| < m

• For |i − j| ≥ m we maintain assumption that Cov[ui , uj ] = 0

Heteroskedasticity-and-autocorrelation-consistent HAC standard er-


rors

• If correct for autocorrelation, already correct for heteroskedasticity

• Stata: newey unemp infl lag(m) where m is from |i − j| < m

• Choosing m (2 methods):

3
– m≈ T , T is sample size of time-series data
– By observation (e.g. corr[et , et−8 ] > 0.2 but corr[et , et−9 ] < 0.2,
choose m = 8
– NOTE: m = 0 ⇒ heteroskedasticity-robust standard error

5.1.4 Clustering
Clustering: Special type of autocorrelation

• Panel data: correlation across time, but not individuals (i.e. ui,t correlated
with ui,t−1 , but not ui−1,t )

• Cross-section data: e.g. grades of students of same subject taught by


different lecturers

Cluster-robust standard errors

• Cluster-robust standard errors: df = G − 1 where G is number of clusters.


Why: each cluster only gives limited amount of information

• Stata: regress y x, vce(cluster variablename)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

5.2. PREDICTION 41

5.2 Prediction
5.2.1 Point forecasts
Forecast: given data point x∗ , what is out best prediction of corresponding y ∗ ?

Point forecast: use y ∗ = b1 + b2 x∗

To build confidence intervals, we need the standard error of the forecast. How
to find depends on whether we want to forecast:
1. Conditional mean: β1 + β2 x∗
2. Actual value: β1 + β2 x∗ + u∗

5.2.2 Forecasting the conditional mean


Conditional mean: Conditional mean of y ∗ is E[y ∗ |x∗ ] = β1 + β2 x∗

Forecast variance becomes smaller when:


• Smaller σu 2 (less noise in population)
• Larger n (more data)
• Larger sx 2 (more varied data)
• Smaller (x∗ − x̄)2 (forecast point closer to sample mean)
Confidence intervals for conditional mean

Stata:
• Add x∗ , leave y ∗ empty, regress y x
• Execute predict yhat, yhat is the point forecast ŷ ∗
• Find standard error of yhat with predict x, stdp, where stdp is standard
error of prediction

5.2.3 Forecasting the actual value


Actual value: Expected forecast for actual value of y∗ is E[y ∗ |x∗ ] + u∗

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

42 CHAPTER 5. WEEK 5: INFERENCE FOR BIVARIATE REGRESSION

5.2.4 Comparison

Hourly wage example:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 6: Bivariate Regres-


sion and Transformations

6.1 Introduction
Data transformations: f (y) = β1 + β2 g(x) + u instead of y = β1 + β2 x + u

We must be careful with the interpretation of the results:

• Marginal effects: marginal effect of x on y

• Retransformation bias: transformations make estimations biased

6.1.1 Easiest transformation: Changing units


Changing units: a linear transformation (i.e. x → kx, y → cy)

Changing the units of x


Changing units of x: x ⇒ kx (e.g. k = 1000 for km → m)

β2
y = β1 + β2 x + u ⇒ y = β1 + (kx) + u
k

Changing the units of y


Changing units of y: y ⇒ cy

y = β1 + β2 x + u ⇒ cy = cβ1 + cβ2 x + cu

6.2 Transformations of x
6.2.1 Dummy variables
Dummy variables introduction
Dummy variable: variable only taking the values 0 or 1 (denoted di )

43

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

44CHAPTER 6. WEEK 6: BIVARIATE REGRESSION AND TRANSFORMATIONS

Regression on a dummy variable

• b2 is an estimate for the difference in means between the 2 groups (1


and 0)

• Hence H0 : b2 = 0 is equivalent to H0 : µ1 = µ2

Example Do mean wages differ significantly between those who did and didn’t
graduate high school?

We transform x, which measures education in years:

Transformations of the regressor

Typically, we regress y on g(x), rather than transforming x:

6.2.2 Marginal effects


The marginal effect
∆ŷ
Marginal effect: ∆x

Model Slope Interpretation

∆ŷ
ŷ = b1 + b2 x b2 = ∆x Expected change in y given unit change in x

∆ŷ
ŷ = b1 + b2 g(x) b2 = ∆g(x) E.g. Expected wage diff. between 2 groups

Recovering the marginal effect


∆ŷ ∆ŷ
Recover ∆x from ∆g(x) :

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

6.2. TRANSFORMATIONS OF X 45

Different marginal effects


Notice that the marginal effect is not constant (i.e. different for every x).
E.g. Marginal effect is 0 until you graduate high school
Hence we need a way to summarise the different marginal effects.

Summarising the marginal effect:


• Average Marginal Effect (AME): Average of marginal effects across all
individuals
$n
i.e. n1 i=1 M E(xi )
• Marginal Effect at the Mean (MEM): M E(x̄)
• Marginal Effect at a Representative value (MER): M E(x∗ ), for some x∗

6.2.3 Other transformations


Generic case
∆ŷ
For ŷ = b1 + b2 g(x), b2 = ∆g(x) . The marginal effect is:

∆ŷ ∆g(x)
= b2 · (6.1)
∆x ∆x
g(x) can be anything, but it is usually:
1. Dummy indicator function
2. Natural logarithm

The linear-log model


Linear-log model: yi = β1 β2 ln(xi ) + ui
∆y
• b2 = ∆ln(x)

∆y ∆y
Note that because ∆ln(x) ≈ ∆x x , ∆ln(x) ≈ ∆x
x
i.e. b2 is approximately equal to the change in y given a relative change in x
∆x
• Rearranging the above, we also get ∆ŷ ≈ b2 · x

i.e. given a relative change in x, y changes by that change in the slope


E.g. if x increased by 1 ⇒ ∆x x = 0.01 ⇒ ŷ increases by b2 · 0.01

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

46CHAPTER 6. WEEK 6: BIVARIATE REGRESSION AND TRANSFORMATIONS

Marginal effects in the linear-log model


Marginal effect:

∆g(x) ∆ln(x) 1
M E(g(x)) = b2 · = b2 · ≈ b2 · (6.2)
∆x x x
NOTE: Effect of ∆x on y decreases as x increases

Example Graduating high school has bigger association with wages than PhD

Linear model: wage


ˆ = −0.90 + 0.54educ
(extra year of education = extra $0.54/hr

Linear-log model: wage


ˆ = −7.46 + 5.33ln(educ)
(6th year raises wage by 5.33/6 per hour, 18th year raises by 5.33/18 per hour
- diminishing returns to education)

6.3 Transformations of y
6.3.1 Transforming y instead of x (Log-linear model)
We have:

• Population model: f (y) = β1 + β2 x + u


ˆ = b1 + b2 x
• Estimated regression: f (y)
ˆ
∆f (y)
Clearly, b2 = ∆x

Marginal effect:
∆ŷ ∆ŷ ˆ
∆f (y) ∆ŷ
= · = b2 · (6.3)
∆x ˆ
∆f (y) ∆x ˆ
∆f (y)

The log-linear model


Log-linear model: ln(ŷ) = b1 + b2 x

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

6.4. TRANSFORMING BOTH X AND Y 47

∆ln(ŷ)
• Slope coefficient: b2 = ∆x
∆ŷ
∆ŷ y
• ∆ln(ŷ) ≈ ŷ ⇒ b2 ≈ ∆x

• i.e. b2 is the predicted relative change in y given a unit change in x

• b2 is also called the semi-elasticity of y with respect to x

Marginal effects in the log-linear model


Marginal effect:

∆ln(ŷ) 1
M E(ln(ŷ)) = b2 / ≈ b2 / = b2 · ŷ (6.4)
∆ŷ ŷ

NOTE: Effect of ∆x on y increases as x increases

Example ln(wage) ˆ = 0.58 + 0.08educ


i.e. Every additional year in school increases wages by 8%

6.4 Transforming both x and y

6.4.1 Log-log model


Log-log model: ln(ŷ) = b1 + b2 ln(x)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

48CHAPTER 6. WEEK 6: BIVARIATE REGRESSION AND TRANSFORMATIONS

∆ln(ŷ) ∆ŷ/ŷ
• Slope coefficient: b2 = ∆ln(x) ≈ ∆x/x

• Slope coefficient now measure relative change in y for every relative change
in x

• b2 is the elasticity of y with respect to x

6.4.2 Marginal effects in the log-log model


Marginal effect:
∆ŷ ŷ
ME = ≈ b2 · (6.5)
∆x x
NOTE: Effect of ∆x on ŷ is again non-constant

6.5 Retransformation bias


Retransformation bias: Obtaining ŷ back from ln(ŷ) is not as simple as
exp(ln(ŷ)), which is a biased estimate of E[y|x].

6.5.1 Avoiding retransformation bias


Given usual assumptions and normality of errors, the unbiased estimator
of E[y|x] is:
σu 2
ŷ = exp( ) · exp(ln(ŷ)) (6.6)
2
But we don’t know σu 2 , so we replace it with se 2

WARNING: This formula DOES NOT WORK if assumptions fail (especially


heteroskedasticity) or errors are not normally distributed

6.6 Choosing a model


Mathematically: No log transformation should be used if a variable has negative
values
Economically:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

6.7. SUMMARY 49

Statistically: Look at data on scatter plot


• Long right tail? (Right-skewed?)
Statistically: use the coefficient of determination, R2
• higher R2 means model fits data better

• R2 is valid for comparing models with same y with different x


• R2 is invalid for comparing models with different y
– Because R2 measures how well model fits left-hand side variables

6.7 Summary

Why log transformations:


1. Economic data is likely to be right-skewed

2. We are usually interested in elasticity

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

50CHAPTER 6. WEEK 6: BIVARIATE REGRESSION AND TRANSFORMATIONS

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 8: Multivariate Data

NOTE: There was no additional content in Week 7 due to the mid-semester


test.

7.1 Problem with bivariate regression


7.1.1 Example: Death rate in car accidents, speed
Bivariate regression model:
Death rate in car accidents = β1 + β2 speed + ui (7.1)
Problem: other factors that might affect death rates are absorbed to ui

Example: Safety is another factor as


Cov(Speed, Saf ety) > 0 (7.2)
Better technology = better speed and better safety

Then:
u = aβ3 ∗ Saf ety + other random errors, a < 0 (7.3)
Implication: b2 will be
• Biased
• Inconsistent
– b2 will actually converge to:
Cov(Speed, Saf ety)
β2 + β3 ∗ (7.4)
V ar(Speed)
– If β2 > 0, b2 can be negative if Cov(Speed, Saf ety) is strong enough
as a < 0

7.1.2 Omitted Variable Bias


Omitted Variable Bias: The omission of a variable Z (e.g. safety), which is
correlated with both the dependent variable (e.g. death rate) and regressor
(e.g. speed), will make OLS estimators biased and inconsistent.
• Can occur in multivariate regression, most common in bivariate

51

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

52 CHAPTER 7. WEEK 8: MULTIVARIATE DATA

7.2 Multivariate analysis: The general plan


Steps in multivariate analysis:

• Data description

– Plots
– Graphs

• Model: Multivariate regression

– Model and parameters of interest


– Estimation
– Assumptions and properties
– Inference
– Interpretation
– Prediction
– Evaluation and Comparison
– Important Special Cases
– Misspecification of the model

7.3 Data Description


7.3.1 Plots
If we have 3 variables, we can make a three-way scatter plot:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

7.4. MULTIVARIATE REGRESSION 53

7.3.2 Graphs
If we have more than 3 variables, we need other methods:

• 3D scatter plot with colour

• Animation for time-series data

• Scatter plot table

Example: Scatter plot table - plots each pairing of variables

7.4 Multivariate Regression


7.4.1 Model and parameters of interest
We have:

• 1 dependent variable of interest (Y )

• Several independent variables (X)

Notation for k random variables:

• Y : dependent variable, outcome, LHS variable

• X2 , X3 , ..., Xk : covariates, explanatory variables, independent variables,


RHS variables, regressors

– NOTE: There is no X1 because β1 is independent from X as it is the


slope

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

54 CHAPTER 7. WEEK 8: MULTIVARIATE DATA

So our model is now:

Y = β1 + β2 X2 + β3 X3 + ... + βk Xk + u (7.5)

Our parameters of interest are β1 , β2 , ..., βk

7.4.2 Estimation
Our line which fits the data best is:

y = b1 + b2 x2 + b3 x3 + ... + bk xk (7.6)

For each individual i, the realisations of X2 , X3 , ..., Xk are:

x2i , x3i , ..., xki (7.7)

The prediction of y given an individual i is:

ŷi = b1 + b2 x2i + ... + bk xki (7.8)

The OLS estimator for β1 , β2 , ..., βk is the set of values for b1 , b2 , ..., bk that
solves:
n
1!
minb1 ,b2 ,...,bk (yi − yˆi )2 (7.9)
n i=1
or equivalently
n
1!
minb1 ,b2 ,...,bk (yi − (b1 + b2 x2i + ... + bk xki )2 (7.10)
n i=1

or equivalently
n
1! 2
minb1 ,b2 ,...,bk (e) (7.11)
n i=1
To minimise, we need to find k different coefficients: b1 , b2 , ..., bk

We solve a system of k linear equations:


n
!
ei = 0
i=1
!n
xji ei = 0 for j = 2, ..., k
i=1

Implications:
• The residuals sum to 0
• Each regressor is orthogonal to the residual

Condition on the data for this system to have a unique solution: we need
adequate variations in the data on all the x values (more on this in the next
lecture)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

7.4. MULTIVARIATE REGRESSION 55

7.4.3 Interpretation
Interpretation for b2 : the partial effect on the predicted value of y when X2
changes by one unit, holding X3 , ..., Xk fixed.

Equivalently, if we have 2 individuals having the same X3 , X4 , ..., Xk , we ex-


pect their difference in predicted Y to be b2 when their X2 differs by 1 unit.

i.e. The effect of x2 ceteris paribus

NOTE: Generally, the value of bj is different from slope of bivariate regression


on y and xj

Changing more than one variable simultaneously


To find the effect of a change in multiple variables, just use the regression line.

Example:

The aggregate effect of an individual staying at a firm for an extra year (i.e.
both experience and tenure increase by 1) is:

ˆ
∆ln(Earnings) = 0.029∆Experience + 0.011∆T enure
= 0.029 + 0.011
= 0.040
≈ 4% in earnings

7.4.4 Fitted values and residuals


Fitted or predicted value:
yˆi = b1 + b2 x2i + b3 x3i + ... + bk xki (7.12)
Residual:
ei = yi − yˆi (7.13)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

56 CHAPTER 7. WEEK 8: MULTIVARIATE DATA

Properties of fitted values and residuals:


1. Sample average of residuals = 0
2. Sample covariance between independent variables and residuals = 0 ⇒
Sample covariance between fitted values and residuals = 0
3. The point (x¯2 , x¯3 , ..., x¯k , ȳ) is always on the regression line

7.4.5 Goodness of fit


Note: All these measures are inappropriate for regressions without an intercept
(more about this when we talk about dummy variables)

Methods of measuring goodness of fit:


• Standard error of the regression
• R2
• Adjusted R2
Standard error of the regression: change in degrees of freedom
%
& n
& 1 !
se = ' (yi − yˆi )2 (7.14)
n − k i=1

R2 : no change from bivariate


n
!
T SS = (yi − ȳ)2
i=1
n
!
ESS = (yˆi − ȳ)2
i=1
n
!
RSS = (yi − yˆi )2
i=1
T SS = ESS + RSS
ESS RSS 2
R2 = =1− , R ∈ [0, 1]
T SS T SS
Problem: R2 always increases when we add more regressors to the model ⇒ we
have to use adjusted R2

Adjusted R2 (R¯2 ):
n − 1 RSS
R¯2 = 1 − · R¯2 ∈ (−∞, 1]
n − k T SS

R¯2 is the balance between degrees of freedom and prediction accuracy.


This refers to the fact that more variables means:
1. Higher R2 , higher prediction accuracy

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

7.4. MULTIVARIATE REGRESSION 57

2. Fewer df , higher se , lower inference accuracy/reliability


We also have the following properties (This will make sense when we cover F
tests):
• Higher R¯2 ⇐⇒ |t| > 1 for an additional regressor
• Higher R¯2 ⇐⇒ |F | > 1 for the joint significance of additional regressors
Example:

7.4.6 STATA Output


Sample STATA output:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

58 CHAPTER 7. WEEK 8: MULTIVARIATE DATA

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 9: Inference for Mul-


tiple Regression Models

8.1 Assumptions
8.1.1 Data assumptions
Conditions needed on data to have a unique solution b1 , b2 , ..., bk :
1. Strictly more than k observations (so n − k > 0)
2. Adequate variation on the regressors - no perfect collinearity
Adequate variation: no regressor in the model can be expressed as an exact
linear combination of the other regressors

Example 1: Regress earnings on age, gender Assume the true relation


to be:

Suppose our dataset only includes males, so di = 1 for all i. i.e. there is
no variation in di in our sample.

This means we can write:

Earnings = 10 + 2 ∗ Age + D
= 10 + 2 ∗ Age + 1 as D = 1
= 11 + 2 ∗ Age
= 11 ∗ D + 2 ∗ Age as D = 1
= (11 − α) ∗ D + α ∗ D + 2 ∗ Age
= (11 − α) ∗ D + α + 2 ∗ Age as D = 1

So, now we have:

Earnings = (11 − α) ∗ D + α + 2 ∗ Age (8.1)

59

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

60CHAPTER 8. WEEK 9: INFERENCE FOR MULTIPLE REGRESSION MODELS

The best fitting line to this model will have:

• Intercept coefficient = α

• Slope coefficient on age = 2

• Slope coefficient on D = (11 − α)

for any α ∈ R. i.e. there are infinite solutions

Example 2: Regress earnings on age, school, experience Model:

Earnings = β1 + β2 Age + β3 School + β4 Experience + u (8.2)

True relation:

Earnings = 10 + Age + School + 2 · Experience + u (8.3)

Now assume that everyone in our dataset enters school at age 6 and works
as soon as they leave school. Then, we get the linear dependence relationship
(which we call perfectly collinear regressors:

Experience = Age − School − 6 (8.4)

Plugging this into our model, we get

Earnings = 10 + Age + School + 2 · (Age − School − 6) + u = −2 + 3 · Age − School + u

Which means the variable Experience is irrelevant, thus giving infinitely many
solutions.

Multicollinearity
Perfectly collinear regressors: regressors have a linear relationship

Multicollinearity: one or more of the regressors are very close to being an


exact linear combination of the other regressors. What this means is the regres-
sors likely measure the same things, incorporating the same information.

Implications of multicollinearity:

• Good

– We do get unique solutions


– STATA runs fine

• Bad: The estimated coefficients are very

– Imprecise: Large standard errors = inference less reliable


– Unstable: adding/deleting observations drastically changes estimates

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

8.2. PROPERTIES OF THE ESTIMATORS 61

8.2 Properties of the estimators


Properties of the residuals:

• The residuals sum to 0

• Each regressor is orthogonal (uncorrelated) to the residuals


$n $n
• i=1 yˆi = i=1 yi

• Each yˆi is orthogonal (uncorrelated) to the residuals

Example: Regress house price on bedrooms, size, bathrooms, lot size,


age, months old Model:

P rice = β1 + β2 Bedrooms + β3 Size + β4 Bathrooms + β5 LotSize


+ β6 Age + β7 M onthsOld + u

We can see the aforementioned properties in STATA output:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

62CHAPTER 8. WEEK 9: INFERENCE FOR MULTIPLE REGRESSION MODELS

8.3 Inference Part 1


8.3.1 Assumptions
Data assumptions
Assumptions: Ensures OLS coefficients are computable
1. Sample size (n) greater than number of regressors (k): n > k
2. Regressors are not perfectly collinear with each other

Population assumptions
Population assumptions: similar to bivariate assumptions
1. Linear model:

Yi = βi + β2 X2i + ... + βk Xki + ui for all i (8.5)

2. Error has mean zero, unrelated to regressors: (Endogeneity)

E[ui |X2i ...Xki ] = 0 (8.6)

3. Homoskedasticity: (Heteroskedasticity)

V ar[ui |X2i ...Xki ] = σu 2 (8.7)

4. Independent errors: (Autocorrelation)

ui , uj independent for all i ,= j (8.8)

Interpretation
• Assumptions 1 & 2: ensures estimators are unbiased and consistent
• Assumptions 3 & 4: determines precision and distribution of estimators

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

8.3. INFERENCE PART 1 63

8.3.2 Properties of OLS estimators


Under Assumptions 1-4, we get the following properties:

1. OLS estimates are unbiased for population parameters:

E[bj ] = βj (8.9)

2. Variance of the OLS estimator bj is

σu 2
V ar[bj ] = σbj 2 = $ 2 (8.10)
x̃ji

where x̃ji is the residual from regressing xji on n intercept and all regres-
sors other than itself
i.e. xj = β1 + β2 x2 + ... + βj−1 xj−1 + βj+1 xj+1 + ... + βk xk

Example: j = 3 Regress x3i on x2i , x4i , ..., xki (with intercept) using
the same data set. The residuals of this regression is x̃3i .

Comments on variance:

• V ar[bj ] smaller when V ar[ui |X2i ...Xki ] = σu 2 smaller

• V ar[bj ] smaller the less xj is explained by other regressors (i.e. less


multicollinearity)

• As n → ∞, V ar[bj ] → 0 (i.e. the estimators are consistent)

3. We do not know σbj 2 as we don’t know σu 2 . So, we estimate σu with se :

se
se(bj ) = #$ (8.11)
x̃2ji

The t-statistic is obviously

bj − β j
t= (8.12)
se(bj )

with the distribution being Tn−k as we have df = n−k. The approximation


is exact if:

• n→∞

• Errors are normally distributed

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

64CHAPTER 8. WEEK 9: INFERENCE FOR MULTIPLE REGRESSION MODELS

4. If:

• n→∞
• Errors are normally distributed

then
bj ∼ N (βj , σbj 2 ) (8.13)

5. OLS estimators are BLUE:

• Best Linear Unbiased Estimator


• Best = minimum variance

If errors are normally distributed, the estimators are BUE

8.3.3 Hypothesis testing: single parameter


t statistic:
bj − β j
T = (8.14)
se(bj )
Note that se(bj ) → 0 as n → ∞
Degrees of freedom:
df = n − k (8.15)

Hypotheses interpretation: β −j is the relationship between Xj , Y holding the


values of other regressors fixed

• H0 : βj = 0 — Xj has no partial effect on the expected value of Y after


controlling for all other explanatory variables

• Ha : βj ,= 0 — Xj has a partial effect on the expected value of Y after


controlling for all other explanatory variables

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

8.3. INFERENCE PART 1 65

NOTE: We assume that all other regressors are non-zero in each hypothesis.
Interpretation of the test: ”Given our other regressors in our population model,
do we still need this regressor?”

e.g. X1 and X2 likely measure similar things (high correlation, multicollinearity)


⇒ both t-statistics liekly low. But it dosn’t mean we don’t need both! Given
X1 , we don’t need X2 , and vice versa. We need to check the join significance of
the regressors.

So what we must do is check the correlation table for multicollinearity before


concluding that a regressor can be omitted (it might share the same information
as another regressor).

T-test:

1. Setting up hypotheses:

H0 : β j = 0
Ha : βj ,= 0

2. Pick a significance level α

3. Calculate the t-statistic:


bj − β j
t= (8.16)
se(bj )

4. Evaluating the t-statistic

• p-value: probability of observing a t-statistic at least as extreme as


the one calculated

p − value = P r[|Tn−k | ≥ |t|] (8.17)

• Critical value:
c = tn−k,α/2 (8.18)

5. Conclusion

• Reject: Xj is statistically significant at the α% level


• Conclude using the economic context of the question

6. STATA Commands

• Critical value tn−k,α/2


display invttail (<n-k>, <a/2>)
• p-value P r[|Tn−k | ≥ a]
display ttail (<n-k>, <a>)

Single-tailed t-test:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

66CHAPTER 8. WEEK 9: INFERENCE FOR MULTIPLE REGRESSION MODELS

1. Setting up hypotheses:

H0 : β j ≥ 0
Ha : β j < 0

or vice versa (H0 is what we want to reject, Ha is our claim)


2. Pick a significance level α
3. Calculate t-statistic

4. Evaluate the t-statistic


• p-value: probability of observing t-statistic at least as large/small as
the t-statistic we calculated

p = P r[Tn−k ≥ t] or P r[Tn−k ≤ t] (8.19)

≤, ≥ is the same as Ha direction


5. Conclusion
Confidence intervals:
• x% CI:
x%CI = bj ± tn−k,x/2 · se(bj ) (8.20)

• Interpretation

– We are x% confident that the CI covers the true population parameter


βj
– If random samples were obtained over and over again, with the CI
computed each time, then x% of the CIs will contain the population
parameter βj

• If df > 120, estimate using t∞

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 10: Inference for Mul-


tiple Regression Models

9.1 Hypothesis Testing: One Linear Combina-


tion of Parameters
Consider a hypothesis test involving two parameters (linear restriction):

9.1.1 Example: Standard earnings (wages)


Model:
ln(earning) = β1 + β2 S + β3 exper + u (9.1)

Question: Does an extra year of formal education (S) have the same effect
on the natural logarithm of earnings (ln(earning)) as an extra year of general
workforce experience (exper)?
H0 : β 2 = β 3
Ha : β2 ,= β3

Which is equivalent to:


H0 : β 2 − β 3 = 0
Ha : β2 − β3 ,= 0

t-statistic:
b 2 − b3
t= (9.2)
se(b2 − b3 )

9.1.2 Calculating se(b2 − b3 )


Note that:
#
se(b2 − b3 ) = V ar(bˆ2 − b3 )
V ar(b2 − b3 ) = V ar(b2 ) + V ar(b3 ) − 2Cov(b2 , b3 )

67

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

68CHAPTER 9. WEEK 10: INFERENCE FOR MULTIPLE REGRESSION MODELS

We will rewrite the model such that the STATA output will directly provide
se(b2 − b3 ). Define:
θ = β2 − β3 (9.3)

Then our model becomes:

ln(earning) = β1 + β2 S + β3 exper + u
= β1 + (θ + β3 )S + β3 exper + u
= β1 + θS + β3 (S + exper) + u
= β1 + θS + β3 X4 + u Define X4 = S + exper

STATA now gives us:

Our test is now:

H0 : θ = 0
Ha : θ ,= 0

Our t-statistic:
θ̂ − 0
t= (9.4)
se(θ̂)

9.2 Hypothesis testing: More than One Linear


Restriction - The F-Test
We now consider whether a group of variables has an effect on the dependent
variable.

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

9.2. HYPOTHESIS TESTING: MORE THAN ONE LINEAR RESTRICTION - THE F-TEST69

We are now doing joint hypothesis tests: more than one restriction on the
parameters.

9.2.1 Example: Parents’ education and child’s birth weight


Independent variables:

• bwght: birth weight

• cigs: average no. of cigs mother smoked per day during pregnancy

• parity: birth order

• faminc: family income

• motheduc: years of education (mother)

• fatheduc: years of education (father)

Our model:

bwght = β1 + β2 cigs + β3 parity + β4 f aminc + β5 motheduc + β6 f atheduc + u


(9.5)

Our question: should motheduc and fatheduc be excluded from the model
after other variables have been controlled for (2 exclusion restrictions)?

H0 : β5 = 0, β6 = 0
Ha : H0 is false

NOTE: Ha holds if either β5 , β6 ,= 0. It is not appropriate to test this joint


hypothesis by considering 2 separate t-tests. Consider the case where we do this
at a 5% significance level:

TEST 1
H0 : β 5 = 0
Ha : β5 ,= 0
b5
REJECT IF | | ≥ tn−k,0.25%
se(b5 )
TEST 2
H0 : β 6 = 0
H0 : β6 ,= 0
b6
REJECT IF | | ≥ tn−k,0.25%
se(b6 )

We reject the joint hypothesis if either are rejected, so the probability of re-
jecting is:
P r[|t5 | ≥ tn−k,0.25% or |t6 | ≥ tn−k,0.25% ] (9.6)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

70CHAPTER 9. WEEK 10: INFERENCE FOR MULTIPLE REGRESSION MODELS

This probability depends on Corr(t5 , t6 ), which we do not know. So, we can-


not know the probability of Type I error.

Consider the possible probabilities:

IF INDEPENDENT
1 − P r[|t5 | ≥ tn−k,0.25% and |t6 | ≥ tn−k,0.25% ]
1 − P r[|t5 | ≥ tn−k,0.25% ] · P r[|t6 | ≥ tn−k,0.25% ]
1 − 95% · 95%
9.75%

IF PERFECTLY CORRELATED
P r = 5%

The F-Test takes the correlation structure into account.

9.2.2 The F-test


F-test:
• F-test is for two-sided alternatives only
• Only if errors are homoskedastic, the F-statistic can be computed based
on the RSS/R2 when the models are estimated with, then without, the
restrictions imposed

Restricted/Unrestricted models
Unrestricted model (UR): Complete/original model

bwght = β1 + β2 cigs + β3 parity + β4 f aminc + β5 motheduc + β6 f atheduc + u


(9.7)

Restricted model (R): Imposing restrictions in null hypothesis

bwght = β1 + β2 cigs + β3 parity + β4 f aminc + u (9.8)

NOTE:
• RSS of Model R ≥ RSS of Model UR
• R2 of Model UR ≥ R2 of Model R

The F-statistic
F-statistic:
RSSR −RSSU R
q
F = RSSU R
(9.9)
n−k

where:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

9.2. HYPOTHESIS TESTING: MORE THAN ONE LINEAR RESTRICTION - THE F-TEST71

• q is the number of restrictions in H0

As T SSR = T SSU R , this is equivalent to:

2 2
RU R −RR
q
F = 1−RU 2 (9.10)
R
n−k

TIP: To handle changes in sign, always make the larger value minus the smaller
value

When errors are heteroskedastic, these formulas are no longer valid

The F-statistic is approximately F (q, n − k) distributed. The distribution is


exact when:

• The data is normally distributed

• n→∞

F-distribution can be denoted F (v1 , v2 ) or Fv1 ,v2 where v1 , v2 are the first (q)
and second (n − k) degrees of freedom

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

72CHAPTER 9. WEEK 10: INFERENCE FOR MULTIPLE REGRESSION MODELS

In our example:
2 2
• RU R = 0.0387, RR = 0.0364

2
RU 2
R −RR
q
• So F = 1−R2
= 1.42
UR
n−k

Critical value of the F-test


Reject if F > critical value. The critical value comes from the (1 − α) percentile
of the F (q, n − k) distribution.

Finding the critical value on STATA:

display invFtail (<q>,<n-k>,<a>)

F-distribution depends on 2 parameters ⇒ we have a table for each α. Use


F (q, ∞) for n − k > 120.

In our example:

• q = 2, n − k = 1191 − 6 = 1185, α = 5%

• Critical value:

display invFtail(2, 1185, 0.05)


3.0033184

• F < cv, so we do not reject the null hypothesis

• Thus, motheduc and fatheduc do not have a significant effect on bwght


after other variables have been controlled for

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

9.2. HYPOTHESIS TESTING: MORE THAN ONE LINEAR RESTRICTION - THE F-TEST73

9.2.3 F-test General Procedure


Model:
Y = β1 + β2 X2 + ... + βk Xk + u (9.11)

Hypotheses:

H0 : q restrictions on the β & s holds jointly


Ha : H0 f alse

Models:

Y = β1 + β2 X2 + ... + βk Xk + u (UR)
Model with q restrictions (R)

F-statistic:
2 2
RSSR −RSSU R RU R −RR
q q
F = RSSU R
= 1−RU 2 if T SSU R = T SSR (9.12)
R
n−k n−k

Critical value:

(1 − α) quantile of the F (q, n − k) distribution (9.13)

Conclude:

• Reject: if F > Critical value

• Do not reject: otherwise

9.2.4 Joint significance tests


Joint significance tests: Testing whether q explanatory variables have co-
efficients equal to 0

H0 : βk−q+1 = 0, ..., βk = 0
Ha : H0 is false

Interpretation:

• H0 rejected: xk−q+1 , ..., xk are jointly statistically significant

• H0 not rejected: xk−q+1 , ..., xk are jointly statistically insignificant


(justifies dropping them from the model)

NOTE: It is possible that a group of variables are individually insignificant


but jointly significant

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

74CHAPTER 9. WEEK 10: INFERENCE FOR MULTIPLE REGRESSION MODELS

9.2.5 Special case: Overall significance


Overall significance test: are all of the k − 1 explanatory variables are jointly
significant

H0 : β2 = 0, .., βk = 0
Ha : H0 false

The F-statistic and p-value of this test is reported in STATA output:

Models:

Y = β1 + β2 X2 + ... + βk Xk + u (UR)
Y = β1 + u (R)

2
F-statistic: RR = 0, q = k − 1 for overall significance tests
2 2 2
RU R −RR RU R
q k−1
F = 1−RU 2 = 1−RU2 (9.14)
R R
n−k n−k

9.2.6 Special case: Testing one restriction


An F-test is equivalent to the t-test when only once restriction is tested, e.g.:

H0 : β j = 0
Ha : βj ,= 0

In this case:
• F − statistic = (t − statistic)2
• Critical value of F − test = (Critical value of t − test)2
• P r[|Tn−k | > |t|] = P r[Fn−k | > t2 ] = P r[F1,n−k > f ], i.e. same p-values

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 11: Data Transforma-


tion

10.1 Data transformations


We require:
Dependent variable is linear in explanatory variables
Note that this is not equivalent to requiring
Y is linear in explanatory variables
So what we can have is
Transformation of Y is linear in transformation of explana-
tory variables
Multivariate transformations:
• Natural log
• Polynomials of regressors
• Dummy variables
• Interaction terms

10.2 Natural log


Take logs when:
• The rate of change of Y is linear in the level of Xj , holding all other
regressors constant
– Regress ln(Y ) on Xj and other regressors
• The level of Y is linear in the rate of change of Xj , holding all other
regressors constant
– Regress Y on ln(Xj ) and other regressors
• The rate of change of Y is linear in the rate of change of Xj , holding
all other regressors constant
– Regress ln(Y ) on ln(Xj ) and other regressors

75

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

76 CHAPTER 10. WEEK 11: DATA TRANSFORMATION

10.2.1 Example: Cobb-Douglas Production Function


Cobb-Douglas production function:

Y = AK α Lβ (10.1)

Suppose we want estimates of A, α, β. We make the following transformation


to a linear model:
ln(Y ) = ln(A) + αln(K) + βln(L) (10.2)
We regress ln(Y ) on ln(K) and ln(L). Our regression is:

ˆ ) = ln(A)
ln(Y ˆ + α̂ln(K) + β̂ln(L) (10.3)

To get an unbiased predicted value for Y , we must account for retransforma-


tion bias and use
2
ˆ )) · exp( se )
Ŷ = exp(ln(Y (10.4)
2
NOTE: Only valid if homoskedastic or normal errors

10.3 Polynomial models


Polynomial models: Fit non-linear relations

10.3.1 Quadratic model


Simple polynomial model: Quadratic model

Y = β1 + β2 X + β3 X 2 + u (10.5)

Here, we regress Y on X and X 2 .


NOTE: Y is not linear in X, Y is linear in X and X 2 .

Normally, when we regress Y on X, we are assuming E[Y |X] changes in the


same
• Direction
• Amount
as X increases by 1 unit.

The quadratic model assumes

E[Y |X] = β1 + β2 X + β3 X 2 (10.6)

This is appropriate when:


• Y increases with X, then Y decreases with X
• Y decreases with X, then Y increases with X
• Y changes slowly when close to a value of X, but Y changes fast when
far from that value of X

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

10.3. POLYNOMIAL MODELS 77

NOTE: Log models account for the case where

Y changes fast when close to a value of X, but Y changes slowly


when far from that value of X

Interpretation
In the quadratic model
Y = β1 + β2 X + β3 X 2 (10.7)
we do not interpret the individual coefficients as the partial effects. This is
because all regressors are necessarily dependent. Instead, we look at the:

Expected change in Y is X changes by 1 unit and hence both X


and X 2 change at the same time

NOTE:

Y changes by a different amount depending on the initial value


of X!

In fact, this is true for all polynomial models.

Examples We have the quadratic model

Y = β1 + β2 X + β3 X 2 (10.8)

X changes from 1 to 2: The predicted change in Y is

∆Y = (β1 + β2 2 + β3 22 ) − (β1 + β2 1 + β3 12 )
= β2 + 3β3

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

78 CHAPTER 10. WEEK 11: DATA TRANSFORMATION

X changes from 10 to 11: The predicted change in Y is

∆Y = (β1 + β2 11 + β3 112 ) − (β1 + β2 10 + β3 102 )


= β2 + 21β3

Marginal Effect
Marginal effect: Predicted change ∆Y when X changes by a very small ∆X
∆Y
ME = (10.9)
∆X
This also depends on the value of X. We can interpret ME as the slope of
the E[Y |X] curve at X (derivative).

In the quadratic model:


M E = β2 + 2β3 X (10.10)
which is just the derivative of the quadratic model.

NOTE: In a quadratic model, AM E = M EM

Example - Regress earnings on education


Data:

Quadratic regression:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

10.3. POLYNOMIAL MODELS 79

Estimated model:
ˆ
earnings = 29252.89 − 3830.641 · education + 439.5283 · education2 (10.11)

Education increases from 0 to 1: Predicted change in earnings

∆earnings = (−3830.641 · 1 + 439.5283 · 12 ) − (−3830.641 · 0 + 439.5283 · 02 )


= −3830.641 · 1 + 439.5283 · 12
= −3391.11

Education increases from 10 to 11: Predicted change in earnings

∆earnings = (−3830.641 · 11 + 439.5283 · 112 ) − (−3830.641 · 10 + 439.5283 · 102 )


= −3830.641 · 1 + 439.5283 · 21
= 5399.49

10.3.2 Cubic Model


Cubic model:
Y = β1 + β2 X + β3 X 2 + β4 X 3 + u (10.12)
The cubic model assumes that

E[Y |X] = β1 + β2 X + β3 X 2 + β4 X 3 (10.13)

Quadratic useful for capturing the relations where:


• Y changes slowly around a particular value of X, but changes fast (in
DIFFERENT directions) when far from this value of X
• ME changes sign ONCE
Cubic useful for capturing the relations where:
• Y changes slowly around a particular value of X, but changes fast (in
SAME directions) when far from this value of X
• ME changes sign TWICE

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

80 CHAPTER 10. WEEK 11: DATA TRANSFORMATION

10.3.3 General case


We can include higher powers in our polynomial:

Y = β1 + β2 X 1 + β3 X 2 + ... + βp+1 X p + u (10.14)

Think about whether assuming the following makes sense

E[Y |X] = β1 + β2 X 1 + β3 X 2 + ... + βp+1 X p (10.15)

NOTE:
• If X p is included, it makes more sense to include all X m for m ≤ p
• Including many powers can be considered a nonparametric way of esti-
mating E[Y |X]

• The F-test can help us decide how many powers to use (generally overes-
timates)
We can include other regressors in our polynomial model:

E[Y |X] = β1 + β2 X 1 + β3 X 2 + ... + βp+1 X p + βp+2 X2 + ... + βp+k Xk (10.16)

Polynomial models are still multiple regression models. They just use
transformed data (similar to linear restriction transformations).

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 12: Data Transforma-


tions Part II

11.1 Dummy variables


Dummy variable: incorporates binary data into regression models - include
dummy variables as explanatory variables

Example Consider the following simple model of wage determination

wage = β1 + δ1 f emale + β2 educ + u (11.1)

δ1 is used as the coefficient for the f emale variable to denote that it is a dummy
variable:
(
1 if individual i is f emale
f emalei = (11.2)
0 if individual i is male

11.1.1 Interpretation
We have:

δ1 = E[wage|f emale = 1, educ] − E[wage|f emale = 0, educ] (11.3)

We can interpret δ1 as the difference in mean wages between females and


males, at the same level of education.

So our model is:


(
β1 + δ1 + β2 educ f or f emale
wage = β1 + δ1 f emale + β2 educ (11.4)
β1 + β2 educ f or male

δ1 represents an intercept shift between the regression lines for males and
females

81

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

82 CHAPTER 11. WEEK 12: DATA TRANSFORMATIONS PART II

11.1.2 Dummy variables and perfect collinearity


Note that we cannot include a dummy variable for both male and females due
to perfect collinearity.

Consider the model

wage = β1 + δ1 f emale + δ2 male + β2 educ + u (11.5)

where (
1 if individual i is male
male1 = (11.6)
0 if individual i is f emale
Then, we have
f emalei + malei = 1 (11.7)
for all individualsi ⇒ perfectly collinear.

We can only use either of these two models:

wage = β1 + δ1 f emale + β2 educ + u


wage = β˜1 + δ2 male + β2 educ + u

These two models are equivalent:

wage = β˜1 + δ2 male + β2 educ + u


= β˜1 + δ2 (1 − f emale) + β2 educ + u
= (β˜1 + δ2 ) − δ2 f emale + β2 educ + u

So we have:
• β2 remains the same
• β˜1 =
, β1 , β1 = β˜1 + δ2 (i.e. intercept in Model 1 is the intercept in Model
2 plus δ2
• δ1 = −δ2

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

11.1. DUMMY VARIABLES 83

• The predicted regression lines for males and females remain the same

Notice the relationship between the following STATA output:

Just think about it case by case:

• What happens when male? f emale = 0, male = 1

• What happens when female? f emale = 1, male = 0

NOTE:

• The sum of the dummy variable coefficient and the intercept is always the
intercept of the other model

• Sum of sample means of dummy variables equals 1. This is because sample


means of dummy variables are sample proportions!

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

84 CHAPTER 11. WEEK 12: DATA TRANSFORMATIONS PART II

11.1.3 Categorical Variables with Multiple Groups


We can also have dummy variables that represent categorical data taking on
more than 2 values.

Example Relationship between quantity of ice creams sold at a grocery store


in a week and the season of the week

We cannot have the model:


ice = β1 + β2 season + u (11.8)
where 

 1 spring

2 summer
season = (11.9)


 3 autumn
4 winter

as this assumes that the differences in ice cream sales across 2 different seasons
are fixed as k · β2 for some integer k ∈ [1, 4].

E.g. Ice creams sold in summer - Ice creams sold in spring = 1 · β2 .

Instead, we must have a different dummy variable for each season. So,
now let’s consider the model:
ice = β1 + δ1 dspring + δ2 dsummer + δ3 dautumn + δ4 dwinter + u (11.10)
The problem here is that we have perfect collinearity:
dspring + dsummer + dautumn + dwinter = 1 (11.11)
So, we must either:
• Exclude 1 of the 4 variables
• Keep all 4 variables, exclude the intercept
These options are all equivalent. I.e. The conclusions will be the same.

Example Let’s exclude the spring dummy variable

We call spring the baseline group or the reference group (the coefficient
of spring has essentially become β1 ). We have the model:
ice = β1 + δ2 ds ummer + δ3 da utumn + δ4 dw inter + u (11.12)
where
β1 = E[ice|spring]
δ2 = E[ice|summer] − E[ice|spring]
δ3 = E[ice|autumn] − E[ice|spring]
δ4 = E[ice|winter] − E[ice|spring]

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

11.2. INTERACTION TERMS 85

This model does not have a problem with collinearity, given that we have ob-
servations from each season in our dataset.

Note: We can calculate the difference in means across groups by finding the
difference in their coefficients

E.g.

δ4 − δ2 = E[ice|winter] − E[ice|spring] − (E[ice|summer] − E[ice|spring])


= E[ice|summer] − E[ice|summer]

These coefficients are shifts in the intercepts of the regression lines for different
groups.

11.1.4 Hypothesis testing


To test whether season has an effect on ice, we just need to test for joint
significance:
H0 : δ 2 = δ 3 = δ 4 = 0 (11.13)
with the F-test.

11.1.5 Multiple Categorical Variables


Same procedure. Make sure to exclude one dummy variable for each
categorical variable. We have, in general, g − 1 dummy variables if we have
g groups.

11.2 Interaction terms


Interaction term: additional variable created to account for two variables as-
sumed to be dependent

We generally want to explain Y using X2 , X3 , ..., Xn :

• Without interaction terms, E[Y |X2 , ..., Xn ] changes with Xi in the


same way when other regressors are fixed at different levels

• With interaction terms, E[Y |X2 , ..., Xn ] changes with Xi differently


when other regressors are fixed at different levels

Example Add new dummy variable nonwhite to the earning model to account
for race

wage = β1 + β2 educ + δf emale f emale + δnonwhite nonwhite + u (11.14)

The assumption in this model is that the difference in wages between males/females
are the same for white/non-whites.

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

86 CHAPTER 11. WEEK 12: DATA TRANSFORMATIONS PART II

If this doesn’t seem like a reasonable assumption, we can add an interaction


term (a new variable:

f e nw = f emale · nonwhite (11.15)

where (
1 if f emale and nonwhite
f e nw = (11.16)
0 otherwise
So, now we have the model

wages = β1 + β2 educ + δf emale f emale + δnonwhite nonwhite + δf e nw f e nw + u


(11.17)
There is no perfect collinearity as long as we have observations for each of
the combinations:

• (f, nw)

• (f, w)

• (m, nw)

• (m, w)

So, we get the predicted mean wages:

NOTE:

• Include an interaction term between two dummy variables ⇒ different


intercepts in regression lines for each group

• Include and interaction term between dummy variable and continuous


variable ⇒ different slopes in regression lines for each group

Example Interaction term between gender and education

Let’s add the interaction term

educ f emale = educ · f emale (11.18)

to get the model

wages = β1 + β2 educ + β3 educ f emale + δf emale f emale + u (11.19)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

11.2. INTERACTION TERMS 87

Then we get the predicated wages:

M ale :β1 + β2 educ


F emale :β1 + β2 educ + β3 educ · 1 + δf emale · 1
= (β1 + δf emale ) + (β2 + β3 )educ

which have different slopes and intercepts.

Here, to test whether there is discrimination against females, it makes sense


to test the joint null hypothesis:

H0 : β3 = δf emale = 0 (11.20)

We have the following models, which affect either the intercept or the slope:

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

88 CHAPTER 11. WEEK 12: DATA TRANSFORMATIONS PART II

11.3 The alternative


Another way to get different predicted values for different groups is to run re-
gressions separately for each group.

E.g. We can estimate the model

wages = β1 + β2 educ + u (11.21)

using observations on females only (Model 1) and then observations on males


only (Model 2). We get the following 2 models:

The results will be equivalent a model including a female dummy and a fe-
male/educ interaction term.

To test for discrimination against females, we need to check if the regression


coefficients are the same in Model 1 and Model 2.

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

11.4. CHOW TEST 89

11.4 Chow Test


Chow Test: Tests whether population regression functions are the same across
different groups

The Chow test is an F-test where we:

• Restricted model uses the full sample.

– This model imposes that the population regression coefficients are


the same across groups.

• Unrestricted model is the combination of regressions of each group.

– RSSU R = RSSU R,group1 + RSSU R,group2 + ... + RSSU R,groupn

• Specify q, n, k

• Compute the F-statistic

• Find critical value from the F-distribution

• Conclude

Example M/F wages and education


Our restricted model is

wages = β1 + β2 educ + u (11.22)

using the full sample.

We consider the RSS in this model: RSSR .

Our unrestricted model is

wages = β1 + β2 educ + u (11.23)

using data on females only to get RSSU R,f emale and then males only to get
RSSU R,male . Then, we use

RSSU R = RSSU R,f emale + RSSU R,male (11.24)

The alternative unrestricted model is

wages = β1 + β2 educ + β3 educ f emale + δf emale f emale + u (11.25)

using the full sample. We use the RSS of this model as RSSU R .

Now we consider our degrees of freedom:

• q = 2 as the null imposes that the 2 regression coefficients are similar


across groups

• k = 4 as we have 2 coefficients × 2 groups

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

90 CHAPTER 11. WEEK 12: DATA TRANSFORMATIONS PART II

• n is the sample size in the full sample

Now, let’s carry out our Chow test.

RSSU R = RSSU R,f emale + RSSU R,male = 30288 + 55929 = 86217


RSSR = 92688
F − statistic = 20.12
F (2, 536, 5

NOTE: The Chow test can only be used where all coefficients are the
same across groups. Otherwise, we have to use the F-test.

i.e. We can only use the Chow test for the null where all βi for all integers
i ∈ [1, n] are the same across groups.

Example We want to test wheter β2 , β3 are the same for males/females while
allowing β1 to differ in the model

wages = β1 + β2 educ + β3 tenure + u (11.26)

Then, we will have to use the F-test where:

• Unrestricted model includes the dummy variable and the full set of
interaction terms (between dummy and other variables)

– UR model has intercept, education, tenure, female, female educ,


female tenure

• Restricted model imposes all null hypothesis restrictions

– R model has only intercept, education, tenure, female

11.5 Binary Dependent Variables


Sometimes, our Y is binary:

• Actual observed values can only be 0 or 1

• Predicted values can be in (−∞, ∞)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

11.5. BINARY DEPENDENT VARIABLES 91

We interpret the predicted values of Y as probabilities:

E[Y |X1 , ..., Xk ] = β1 + β2 X2 + ... + βk Xk


= 1 ∗ P r[Y = 1|X1 , ..., Xk ] + 0 ∗ P r[Y = 0|X1 , ..., Xk ]
= P r[Y = 1|X1 , ..., Xk ]

Thus, regressing a binary dependent variable Y on X1 , ..., Xk is called the linear


probability model.

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

92 CHAPTER 11. WEEK 12: DATA TRANSFORMATIONS PART II

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

Week 13: Model Misspecifi-


cation

12.1 Data checks


12.1.1 Errors and missing values
Generally, do not exclude observations with missing values.

• The fact that they are missing values provide information

• Excluding these can lead to selection bias

12.1.2 Outliers
Outlier: an observation whose value is unusual in the context of the rest of the
data

Outliers:

• Usually heavily influence the regression outcome

• Should not be simply dropped from the data

Importance of outlier screening:

• Avoid erroneous data

• Large impact on OLS estimations

Take care if your results change significantly when a few key observations (’in-
fluential observations’) are dropped. Results with these observations dropped
may be included as a robustness check.

Checking for outliers


Univariate case Box/Whisker plot

93

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

94 CHAPTER 12. WEEK 13: MODEL MISSPECIFICATION

Bivarate case Check the plot of Y on X

Multivariate case Common practice is to plot residuals against each regres-


sor. If there are too many regressors, plot the residuals against the fitted values
Ŷi

2 ways to quantify the effect of outliers:

• DFITS: change in fitted values

• DFBETA: change in coefficient estimates

12.2 Model Checks


Ways in which assumptions can be wrong:

• Wrong model chosen

– Omitted Variable Bias


– Irrelevant variables included
– Non-linear relationship

• Errors correlated with:

– Regressors (Endogeneity)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

12.2. MODEL CHECKS 95

– Each other (Autocorrelation)

• Errors have non-constant variance (Heteroskedasticity)

12.2.1 Wrong model


Omitted Variable Bias
Suppose we use the model

Y = β 1 + β2 X2 + u (12.1)

despite the true model being

Y = β 1 + β2 X2 + β 3 X3 + u (12.2)

If all assumptions hold, regressing Y on X2 and X3 should give E[b2 ] = β2 .

But, we regress Y on X2 only. So X3 is the omitted variable. We get:

Ŷ = b˘1 + b˘2 X2 (12.3)

In general, E[b˘2 ] ,= β2 . If X2 is correlated with X3 , b˘2 picks up the effects of


both regressors. So, if we have

X3 = γ 1 + γ 2 X 2

then we get

E[b˘2 ] = β2 + β3 γ2

Don’t worry about how this is derived.

NOTE: We also have the error term absorb the OVB:

ui = u + β 3 X3

So, we see that E[b˘2 ] is equal to β2 plus:

• γ2 : The relationship between the omitted variable and the regressor X2

• β3 : The relationship between the omitted variable and the dependent


variable Y

So, as long as both γ2 , β3 ,= 0, we have E[b˘2 ] ,= β2 . b˘2 is a biased estimator


for β2 , the true population coefficient for X2 . This is omitted variable bias.

This formula allows us to determine the sign or direction of the bias. We


have
E[b˘2 ] = β2 + γ2 β3 (12.4)
Upward bias: E[b˘2 ] > β2 when

• γ 2 , β3 > 0

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

96 CHAPTER 12. WEEK 13: MODEL MISSPECIFICATION

• γ 2 , β3 < 0
Downward bias: E[b˘2 ] < β2 when
• γ2 > 0, β3 < 0
• γ2 < 0, β3 > 0
No bias: E[b˘2 ] = β2 when
• γ2 = 0
• β3 = 0
Dealing with OVB:
• Simplest: Include the omitted variable in our regression
• Advanced techniques: instrumental variables, natural experiments, etc.
• Last resort: determine the sign of the bias

Example Determining the sign of the bias - Returns to Education

It is difficult to estimate returns to education due to personal ability being


an omitted variable:
• Higher ability people are more likely to go to school γ2 > 0
• Higher ability people are more likely to have higher-earning careers β3 > 0
We have γ2 , β3 > 0, so the bias is upwards.

Irrelevant variables
Implications of including irrelevant variables, or too many variables:
• OLS estimators remain unbiased and consistent
• OLS estimators are less precise
– Multicollinearity may be a problem
– Multicollinearity → se increases → t-test less accurate when n is
small

12.2.2 Endogeneity
In the regression model

Y = β1 + β2 X2 + ... + βk Xk + u (12.5)

if u is correlated with a regressor Xj , this regressor is endogenous.

Common sources of endogeneity:


• OVB (as ui = u + βk Xk when Xk omitted)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

12.2. MODEL CHECKS 97

• Simultaneous change in variables (e.g. Y increase ⇒ X increase and vice


versa)

– Simultaneity bias: Y affects X, X affects Y, so both effects are insep-


arably combined into the coefficient of X. This means the coefficient
estimate is biased.

• Sample selection (e.g. non-representative sampling on the dependent vari-


able)

Implications of endogeneity: OLS estimators become

• Biased

• Inconsistent

12.2.3 Functional form misspecification


Functional form misspecification: Model does not properly take into ac-
count relationship between dependent and independent variables. Occurs when
a key variable, which is a function of other variables, has been omitted.
E.g. non-linear relationship

Implications:

• se increases

• Estimates are biased/inconsistent

Example 1
True population model:

Y = β1 + β2 educ + β3 age + β4 age2 + u

Our model:

Y = β1 + β2 educ + β3 age + u

Implication: β1 , β2 , β3 all become biased

Example 2
True population model:

ln(income) = β1 + β2 educ + β3 age + β4 f emale + β5 f emale × age + u

Our model:

ln(income) = β1 + β2 educ + β3 age + β4 f emale + u

Implication: β1 , β2 , β3 , β4 all become biased

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

98 CHAPTER 12. WEEK 13: MODEL MISSPECIFICATION

Plots for detecting functional misspecification


To examine the overall fit of a model, look at the scatter plot of:

• Actual values of Y against X vs the fitted line

• Residuals vs predicted values

To examine the functional form of a single regressor, look at the scatter plot
of:

• Residuals vs the regressor

Test for detecting functional misspecification


Regression Specification Error Test (RESET Test): a test for neglected
non-linear terms

• A special type of F-test

Limits of the RESET Test:

• n increases ⇒ estimator precision increases ⇒ change of rejecting H0


increases (in fact, H0 is rejected at any significance level as n → ∞)

• Cannot tell extent of model misspecification

• Doesn’t indicate how to correct misspecification

Consider the model

Y = β1 + β2 X2 + ... + βk Xk + u (R)

If this model is correct, then:

• Population assumptions are satisfied

• Including any particular choices of higher order terms or interac-


tion terms should not have any significant effect on the model

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

12.2. MODEL CHECKS 99

The RESET test adds polynomials of the OLS fitted values Ŷ , usually Yˆ2 and
Yˆ3 .

• Yˆ2 and Yˆ3 are 2 particular combinations of higher order terms and inter-
action terms of the regressors

• If the model is correct, adding Yˆ2 and Yˆ3 should not have any significant
effect

So, we get the model

Y = β1 + β2 X2 + ... + βk Xk + γ1 Yˆ2 + γ2 Yˆ3 + u (UR)

RESET Test:

1. Estimate restricted model (R) to get the predicted values Ŷ

2. Estimate unrestricted model (UR) with Yˆ2 and Yˆ3 as additional regressors

3. Perform F-test

• H0 : γ1 , γ2 = 0 and Ha : H0 f alse
• In this case, F-statistic is F (2, n − (k + 2)) distributed
• If we reject, (R) is likely to be misspecified

Example Comparing 2 Housing Price Models

We wish to compare the following 2 models:

price = β1 + β2 lotsize + β3 sqrf t + β4 bdrms + u (1)


ln(price) = β1 + β2 ln(lotsize) + β3 ln(sqrf t) + β4 bdrms + u (2)

Model 1:
ˆ 2 + γ2 price
• UR: price = β1 + β2 lotsize + β3 sqrf t + β4 bdrms + γ1 price ˆ 3 +u

• n = 88, q = 2, k = 4, F2,88−4−2

• F = 4.67

• Reject ⇒ likely misspecified

Model 1:
2
ˆ
• UR: ln(price) = β1 +β2 ln(lotsize)+β3 ln(sqrf t)+β4 bdrms+γ1 ln(price) +
3
ˆ
γ2 ln(price) +u

• n = 88, q = 2, k = 4, F2,88−4−2

• F = 2.56

• Do not reject ⇒ likely correctly specified

So, we prefer the log-log specification.

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

100 CHAPTER 12. WEEK 13: MODEL MISSPECIFICATION

12.2.4 Heteroskedasticity
Heteroskedasticity: Errors for different observations have different variances.

Check scatter plots of:


• Y against X
• Residuals against X
Homoskedasticity

Heteroskedasticity

Implications of heteroskedasticity:
• Heteroskedasticity does NOT:
– Cause bias or inconsistency
– Affect estimated coefficients
– Affect goodness of fit (R2 , R¯2 )
• Heteroskedasticity does:
– Invalid formulas/estimators of variance ⇒ standard errors of estima-
tors are invalid ⇒ inferences invalid
Fix:
regress y x, robust

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)


lOMoARcPSD|23815106

12.2. MODEL CHECKS 101

12.2.5 Autocorrelation
Implications:

• Estimators remain unbiased and consistent


• se calculations and inferences become much less accurate
2 common occurrences:

• Time series data


– Fix: HAC standard errors
• Clusters: errors correlated for observations in the same cluster, uncorre-
lated for observations in different clusters

– Cluster robust standard errors

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

You might also like