1st in Course Ecmt1020 Notes

lOMoARcPSD|23815106
ECMT1020 Notes
Sean Gong
July 10, 2020
Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

lOMoARcPSD|23815106

lOMoARcPSD|23815106
Week 1: Summarising Uni-

variate Data
1.1 Types of data

1.1.1 Classifying data
Dataset: records the values of variables on several cases
• Variable: Characteristic of interest
• Case: Unit for which we measure a variable
Data classification criteria:
• Variable number: univariate, bivariate, multivariate
• Variable type: quantitative, categorical
• Case type: cross-section, time series (esp. used in economics), panel data
Different statistical techniques are used for each data type.
1.1.2 Types of variables

Numerical data
• Continuous: Can be arbitrarily precise
• Discrete: Cannot take arbitrarily precise values
Categorical data
• Not naturally numbers
NOTE: Numbers can be used to represent categorical data

lOMoARcPSD|23815106
4 CHAPTER 1. WEEK 1: SUMMARISING UNIVARIATE DATA
1.1.3 Types of cases

• Cross-section data: Each cases is a separate ’individual’ (e.g. 2019 GDP
data for all OECD countries)
• Time series data: Same ’individual’ at different points in time (e.g. Aus-
tralian GDP growth 2000-2009)
• Panel data: Combination of cross-section and time series; A characteristic

recorded for different individuals at different points in time (e.g. GDP
data for all OECD countries in 20th century)
1.1.4 Notation
Example: Suppose we have univariate data on x (e.g. x is annual GDP growth
• Cross section: n data points
– Each case is labelled xi for i = 1, 2, 3, ..., n

– Usually arbitrary (e.g. x1 = Albania, x2 = Afghanistan, etc.)
• Time series: t data points
– Each case is labelled xt

– (e.g. x1 is GDP growth for 1970, x2 for 1971)
• Panel data: n × t data points
– Data points labelled xnt
Multivariate cases
• Bivariate: effect of explanatory variable x on response variable y E.g.

x = education level, y = annual income
• Multivariate: effect of explanatory variables x1 , x2 , ... on y E.g. x1 =

Chinese GDP growth, x2 = Australian inflation, y = Australian unem-
ployment rate
1.2 Summarising data

1.2.1 Statistics Recap
Statistics: using data to understand a parameter we cannot observe
Statistical model: using data to understand a population we cannot observe
E.g. Every human being on Earth
Statistics summary:
1. Assume dataset is a sample of the population
2. Calculate estimate of population parameter from sample
• Convention: Greek letters for population parameters, Latin letters

for sample estimates

lOMoARcPSD|23815106
1.2. SUMMARISING DATA 5
3. Inference to say something about the estimated parameter (e.g. confi-

dence intervals, hypothesis tests, etc.)
Data analysis summary:
1. Summary: Summary statistics and graphs to illumate essential features

of the data
2. Inference: What do these numbers tell about the parameters we are trying
to estimate?
3. Interpretation: What is the economic meaning?
1.2.2 Summary statistics

4 summary statistics - measures of:
1. Central tendency E.g. Average, median
2. Dispersion E.g. SD
3. Asymmetry E.g. Skew
4. Kurtosis E.g. Outliers
Central tendency
Sample mean
n
1!
x̄ = xi (1.1)
n i=1
Stata: mean [data], tabstat [data], or stat(mean)
Sample median Median: Middle observation
• Advantage: Not prone to outliers
• Disadvantage: Not sensitive to outliers
Stata: summarize [data], tabstat [data], or stat(median)
Sample mode Mode: Most frequently occuring observation
• Not very useful for continuous data; can be made artificially discrete (by
grouping)
Sample midrange Midrange: Average of min and max observations
• Very sensitive to outliers; essentially useless
Dispersion
Quantiles Quantiles: Special type of percentiles

lOMoARcPSD|23815106
Sample variance
n
1 !
s2 = (xi − x̄)2 (1.2)
n − 1 i=1
NOTE: n - 1 is the degrees of freedom; not n because we are using sample mean
x̄ not population mean µ
• Degrees of freedom: number of values involved in calculations that have

the freedom to vary
• Total number of observations - number of independent constraints on ob-

servations
Standard deviation
√
2
s= s (1.3)
68-95-99.7 rule
Coefficient of variation
s
cv = (1.4)
x̄
• Advantage: unit-free ⇒ can be compared across different variables
• Disadvantage: nobody uses it
Range
range = min − max (1.5)
Interquartile range
IQR = 3rdquartile − 1stquartile (1.6)
Symmetry
Symmetry: Similarity when reflected about the median
n
1 ! xi − x̄ 3
Skew(x) = ( ) (1.7)
n i=1 s
Kurtosis
Kurtosis: fatness of the tails; how much frequency is distributed to the tails
n
1 ! xi − x̄ 4
Kurt(x) = ( ) (1.8)
n i=1 s
x normally distributed ⇒ Kurt(x) = 3
• Excesskurtosis = Kurt(x) − 3

lOMoARcPSD|23815106
1.2. SUMMARISING DATA 7
1.2.3 Graphs
Box plot
Box plot indicates:
• Minimum
• Maximum
• Quartiles (incl. median)
• Skewness
Note: Outliers are above U pperquartile + 1.5 ∗ IQR
Histogram
Histogram: Shows how often certain values occur
• Primarily useful for cross-section data
• Not as useful for time-series // no info on when values occurred
• y-axis: frequency or percentages
– Latter more useful with large datasets

lOMoARcPSD|23815106
Histograms for continuous variables:

• Each value has frequency of 1 ⇒ directly
√ using histogram not useful ⇒
group observations in bins (usually n bins) ⇒ makes data discrete
• Alternatives:
– Smoothed histogram
– Kernel density distribution
Line graphs
Line graph: plots observations against observation number
• Shows how values changing over time
• Only useful where x-axis is variable with natural ordering
Stata: tsline
Categorical data
Options:
• Frequency table
• Bar chart
• Pie chart

lOMoARcPSD|23815106
Week 2: Probability Recap,

Inference on Means
2.1 Probability recap

2.1.1 Experiments and random variables
Experiment: An operation whose outcome cannot be predicted with certainty
Random variable: Variable whose value depends on the outcome of the experi-
ment
• E.g. Coin flipping: Random variable is no. of heads
• Notation:
– Random variable: X (Upper case)

– Specific value/outcome: x (lower case)
∗ One realisation of X is x
– Probability Mass Function (PMF) of X: Pr[X = x]
i.e. probability of X taking value x
– Cumulative Distribution Function: Pr[X ≤ x]
Case study: Filling a coin 2 times, random variable = no. of heads

1
• 4 outcomes: TT, TH, HT, HH; each probability 4
• Values of X: 0, 1, 1, 2 respectively
PMF:
1
– Pr[X = 0] = 4
1 1 1
– Pr[X = 1] = 4 + 4 = 2
1
– Pr[X = 2] = 4
CDF:
1
– Pr[X ≤ 0] = 4
1 1 1 3
– Pr[X ≤ 1] = 4 + 4 + 4 = 4
3 1
– Pr[X ≤ 2] = 4 + 4 = 1

lOMoARcPSD|23815106
10CHAPTER 2. WEEK 2: PROBABILITY RECAP, INFERENCE ON MEANS
2.1.2 Discrete vs. continuous

Random variables can be:
• Discrete: X takes on discrete values (e.g. coin flipping)
• Continuous: X take on uncountable no. of values
CDF: Generalises nicely (i.e. applicable to both discrete/continuous)

PMF: Does not generalise // for continuous, P[Z = z] = 0 (uncountable)
⇒ Probability Density Function (PDF): derivative of the CDF
• If normal distribution, PDF is bell curve

lOMoARcPSD|23815106
2.1. PROBABILITY RECAP 11
2.1.3 Expected value

Expected value: Weighted average of all x that X can take, weights determined
by probability
!n
E[X] = xi · P r[X = xi ] (2.1)
i=1
E.g. Coin flipping example:
1 1 1
E[X] = 0 · +1· +2· =1 (2.2)
4 2 4
⇒ EV = 1 head flip
This is a population quantity ⇒ E[X] is the population mean of X, denoted µ
2.1.4 Population variance

Sample variance: average of (xi − x̄)2 (divide by n-1)
Population variance: average of (X − µ)2
n
!
σ 2 = V ar[X] = E[(X − µ)2 ] = (xi − µ)2 · P r[X = xi ] (2.3)
i=1
NOTE: Similarity between EV and variance

EV: Weighted average of all values X can take Variance: Weighted average of
all squared difference between random variable and EV
Coin flipping example:
1 1 1
σ 2 = (0 − 1)2 · + (1 − 1)2 · + (2 − 1)2 · (2.4)
4 2 4
" 1
P opulationSD = σ = 1/2 = √ (2.5)
2
√
Population standard deviation of X: σ2 = σ
NOTE: Skewness and kurtosis have population analogues (not in ECMT1020)
2.1.5 Linear transformations of random variables

Linear transformation: Finding Y = a + bX where we know the mean and
variance of X
E.g. Celsius → to Fahrenheit
• Mean of Y = a + bµ
• Variance of Y = b2 σ 2
√
• SD of Y = b2 σ 2 =| b | ·σ
NOTE: b ∈ R, σ = SD ≥ 0

lOMoARcPSD|23815106
2.1.6 Standardisation
X−µ
σ always has SD 1 and mean 0 // linear transformation
X−µ
• Y = σ ,Y = a + bX
−µ 1
• a= σ ,b = σ
−µ µ
• Mean of Y = a + bµ = σ + σ =0
1
• SD of Y =| b | ·σ = σ ·σ =1
Y is the standardised form of X
2.1.7 Linear combination

Linear combination: aX + bY where X, Y are random variables, a, b are non-
random constants
E[ax + bY ] = a · E[X] + b · E[Y ] (2.6)
X and Y independent (IMPORTANT: make sure no correlation!)⇒
V ar[aX + bY ] = a2 V ar[X] + b2 V ar[Y ] (2.7)
We obtain the above result from the fact that
V ar[aX + bY ] = a2 V ar[X] + 2Cov(X, Y ) + b2 V ar[Y ] (2.8)
2.2 Sampling distributions

Sample statistics should give information about population statistics
x̄ ⇒ µ, s2 ⇒ σ 2 (2.9)
KEY: Sample statistics are themselves random variables (i.e. they are real-
isations of X)
⇒
Sample statistics have a distribution
2.2.1 Observations are realisations of random variables

Identically distributed: Distribution of sample random variables same as
distribution of whole population
Independent: Two observations are independent if knowing one does not give
information about the other
2.2.2 The sample mean as a random variable

1
Sample mean: x̄ = n (x1 + x2 + ... + xn )
1
It makes sense to think of it as: X̄ = n (X1 + X2 + ... + Xn )
x̄ is a realisation of X̄
Study distribution of X̄ → quantify what x̄ tells us about µ

lOMoARcPSD|23815106
2.2. SAMPLING DISTRIBUTIONS 13
2.2.3 The distribution of X̄

Assumptions:
• X1 , X2 , ..., Xn are independent and identically distributed (iid)
• X1 + X2 + ... + Xn is selected from normal distribution with mean µ

and variance σ 2
– Central Limit Theorem: Sum of random variables will be nor-

mally distributed, even if the variables themselves are not
We want to find µ without knowing µ or σ 2
Expected value of X̄
Given that:
1
• X̄ = n X1 + n1 X2 + ... + n1 Xn
• E[X1 ] = µ, E[X2 ] = µ, ..., E[Xn ] = µ
Then applying linear combinations:

1 1 1 1
E[X̄] = · µ + · µ + ... + · µ = n · · µ = µ (2.10)
n n n n
We say that the sample mean X̄ is an unbiased estimator of the population
mean µ
Variance of X̄
1 1 1 1 1 1 1 σ2
V ar[X̄] = V ar[ X1 + X2 +...+ Xn ] = 2 ·σ 2 + 2 ·σ 2 +...+ 2 ·σ 2 = n· 2 ·σ 2 =
n n n n n n n n
(2.11)
SD of X̄ is √σn
• Variance reduces as sample size grows
2.2.4 Statistical properties of X̄

Sample mean X̄ properties:
• X̄ is unbiased estimator: E[X̄] = µ
• X̄ is consistent estimator: distribution more concentrated around µ as

sample size grows
For every constant c > 0:
lim P r[µ − c ≤ X¯n ≤ µ + c] = 1 (2.12)

n→∞
• X̄ has the minimum variance of all unbiased estimators of µ

lOMoARcPSD|23815106
2.3 Inference on the mean

2.3.1 The z statistic
2
σ2
X̄ ∼ N (µ, σn ): X̄ has normal distribution with mean µ and variance n
The z statistic: difference between sample mean and population mean as pro-
portion of SD; how many SDs away from the population mean?
X̄ − µ
z= (2.13)
√σ
n
NOTE: Z ∼ N (0, 1)
2.3.2 Standard error of the sample mean

s2 is realisation of random variable S 2 with expected value σ 2
⇒
We use s to estimate σ
⇒
SD of X̄ estimated as √sn
√s is the standard error of the sample mean

n
2.3.3 The t statistic

X̄ − µ
t= (2.14)
√S
n
T does not have a normal distribution. Instead, the t-statistic approximates:

• t distribution with n-1 degrees of freedom
• t distribution approaches normal as n increases
The t-statistic is exactly T distributed when:
• n→∞
• X ∼ N (µ, σ 2 )

lOMoARcPSD|23815106
2.3. INFERENCE ON THE MEAN 15
2.3.4 Confidence intervals

lOMoARcPSD|23815106
2.3.5 Hypothesis testing

Steps:
• Null hypothesis H0 : claim to be rejected - generally µ = µ∗

Alternative hypothesis Ha : alternative to claim - generally µ ,= µ∗
• Basic idea: Assuming H0 is true, how unlikely is our sample estimate?
• Significance level α: cut-off point between extremely unlikely and not

unlikely
The p value
p-value: probability of sample estimate if H0 is true
• Reject H0 if p-value < α

lOMoARcPSD|23815106
2.3. INFERENCE ON THE MEAN 17
Critical values
Critical region: Values of t such that H0 should be rejected
• Thus probability of being in critical region is α
• Critical region: |t| > tn−1,α/2
• Critical value: boundary value tn−1,α/2
2.3.6 Type I and Type II Errors

Type I Error: Reject true H0
Type II Error: Do not reject false H0
We can control for these errors by adjusting α
Trade-off: decreasing Pr[Type I Error] increases Pr[Type II Error], vice versa.

lOMoARcPSD|23815106
This occurs when we increase α. The inverse of this is also true.
NOTE: Because we can adjust α, the probability of Type I and II Er-

ror is completely under the control of the statistician.
α is the maximum probability of Type I error.
We also have, in Colin Cameron’s textbook, the following definitions:
• Test size: Pr[Type I error] = α

• Test power: 1 - Pr[Type II error]

lOMoARcPSD|23815106
Week 3: More hypothesis tests,

Data transformations
3.1 Further hypothesis testing

3.1.1 2-sided vs. 1-sided tests
2-sided tests: do not care about direction of effect (larger/smaller)
• Usually used to answer question ”does X have any influence on Y?”
• Reject H0 if sample mean much larger OR smaller
1-sided tests: care about direction of effect
• Reject H0 only if smaller OR larger
3.1.2 1-sided hypotheses

2 options for 1-sided test:
1. H0 : µ ≤ µ∗ vs Ha : µ > µ∗
2. H0 : µ ≥ µ∗ vs Ha : µ < µ∗
2-sided example: claim = ”Annual average salary is $40000”
1. H0 : µ = 40000
2. Ha : µ ,= 40000
IMPORTANT: Here, rejecting H0 means ”claim is wrong” (i.e. H0 is the
claim)
1-sided example: claim = ”Annual average salary is above $40000”

1. H0 : µ ≤ 40000
2. Ha : µ > 40000
IMPORTANT: Here, rejecting H0 means ”claim is right” (i.e. Ha is the claim)
COMMENT: Why is it that H0 serves different purposes for 1-sided
and 2-sided tests?
19

lOMoARcPSD|23815106
20CHAPTER 3. WEEK 3: MORE HYPOTHESIS TESTS, DATA TRANSFORMATIONS
3.1.3 p-values and critical regions for 1-sided tests

Consider:
1. H0 : µ ≤ 40000
2. Ha : µ > 40000
2 approaches (same as 2-sided tests):
1. p-value approach
• More unlikely x̄ = more consistent with Ha
• Likelihood of x̄ measured by p-value from t statistic
• In this case, p-value is P R[Tn−1 ≥ t]; P R[Tn−1 ≤ t] if opposite case
• 2-sided: P R[|Tn−1 | ≥ |t|]
2. Critical value approach
• 2-sided: Reject for very large AND small values of t
• 2-sided critical region: |t| > Tn−1,α/2
• 1-sided: Reject for very large OR small values of t
• 1-sided critical region: t > Tn−1,α OR t < −Tn−1,α
– Same inequality sign at Ha

lOMoARcPSD|23815106
3.1. FURTHER HYPOTHESIS TESTING 21
3.1.4 t-tests in Stata

Stata: ttest reports both 1-sided AND 2-sided
i.e. ttest earnings==40000 returns p-values for 3 tests:

1. H0 : µ = 40000 vs Ha : µ ,= 40000
2. H0 : µ ≤ 40000 vs Ha : µ > 40000
3. H0 : µ ≥ 40000 vs Ha : µ < 40000
3.1.5 Hypotheses concerning other parameters

To test other parameters (e.g. diff. between 2 means, proportion) we can either:
1. Follow same procedure of t test for µ
2. Use Central Limit Theorem:
estimator − parameterV alueU nderH0
t= (3.1)
SEof Estimator
approx. follows T distribution, where df = n − estimatedP arameters
(estimated parameters excl. SE)
Confidence intervals and hypothesis tests

CLT means normal procedure for confidence intervals
100(1-α)% confidence interval for a parameter:

estimate ± tdf,α/2 × SE (3.2)
Hypothesis tests still same: compute t statistic, reject H0 if pV al < α

lOMoARcPSD|23815106
Difference in means
Example: Male average annual salary higher than female average annual salary?
We have:
• 2 independent random variables
1. Male ave. ann. salary: X1 ∼ N (µ1 , σ12 )

2. Female ave. ann. salary: X2 ∼ N (µ2 , σ22 )
• Hypotheses
– H0 : µ 1 − µ 2 ≤ 0
– Ha : µ 1 − µ 2 > 0
• Independent random samples
– x1,1 , x1,2 , x1,3 , ..., x1,n1

– x2,1 , x2,2 , x2,3 , ..., x2,n2
t statistic:
• µ1 − µ2 estimated by: x̄1 − x̄2

#
2 2
– se(x̄1 − x̄2 ) = sn11 + sn22
• n1 + n2 observations, 2 estimated means HENCE df = n1 + n2 − 2
Hence:
(X̄1 − X̄2 ) − (µ1 − µ2 )
t= # (3.3)
s1 2 s2 2
n1 + n 2
Example:

lOMoARcPSD|23815106
3.2. UNIVARIATE DATA TRANSFORMATION 23
Proportions
Example: Proportion of loans in default
A proportion is a mean: mean of a variable which can take on the value

of either:
• 1: if the thing we wish to measure occurs (defaulted loan)
• 0: if the thing we wish to measure doesn’t occur (no default)
It is a mean, so the usual t test should work
Let population proportion be p, sample proportion be p̂:
• H0 : p = p∗ vs Ha : p ,= p∗
#
p∗ (1−p∗ )
• SE of p̂: n
Variance is known under H0 ⇒ we can use the normal distribution instead

of t distribution
NOTE: hard to estimate populations close to 1 or 0 ⇒ we need:
np, n(1 − p) > 10 (3.4)
3.2 Univariate data transformation

3.2.1 Logarithms recap
A logarithm is an exponent:
• E.g. 103 = 1000, log10 1000 = 3
Natural logarithm uses constant e as a convenient base

Exponential function ex often written as exp(x)

lOMoARcPSD|23815106
3.2.2 Linearising exponential growth

Linear approximation to the natural logarithm:
Usefulness of logarithms/exponentials
• Many economic variables are exponential
• Our models are linear
• Logarithms/exponentials allow transformations between the two
Example: a0 dollars in an account with annual interest rate of 100r
at = a0 (1 + r)t ⇒ ln(at ) = ln(a0 ) + tln(1 + r) (3.5)
The exponential function at is now a linear function ln(at )
Example: Australian GDP transformation
3.2.3 Proportionate changes

Logarithms are useful in approximating proportionate change:
• Change in variable: ∆x = xt+1 − xt

lOMoARcPSD|23815106
3.2. UNIVARIATE DATA TRANSFORMATION 25
∆x
• Proportionate change: xt
∆x
• Logarithmic transformation: ∆ln(x) ≈ xt
3.2.4 Eliminating right skewness

Logarithms can reduce skewness; most economic variables are right-skewed:
• ln(x) highly steep for small x, flat for large x
• THUS inflates small x, reduces large x
• Brings closer to normal distribution
COMMENT: Will analyzing transformed data lead to biases; is it

still reliable?
3.2.5 Other useful transformations

• Adjusting by price level to get real values
– Given a base year B:
priceB
realGDPt = nomGDPt × (3.6)
pricet
• Adjusting by size of population to get per capita values

– Given the population:
realGDPt
perCapRealGDPt = (3.7)
populationt
3.2.6 Moving averages

Moving average (MA) smooths fluctuations and highlights long-term pat-
terns
1. simple MA: averages over previous recent observations
1
x̃t = · (xt−4 + xt−3 + xt−2 + xt−1 + xt ) (3.8)
5

lOMoARcPSD|23815106
2. centred MA: averages over surrounding observations

1
x̃t = · (xt−2 + xt−1 + xt + xt+1 + xt+2 ) (3.9)
5
Simple MA more popular // can be calculated with time t
NOTE: 5 is not a fixed constant
Deseasonalising data (12-month MA):
3.2.7 Growth rates

xt −xt−1
Growth rate is a proportionate change xt−1
2 reporting options for quarterly/monthly data:

1. Annualised growth rate: extrapolates current quarter growth to a while
year:
xt − xt−1
4× (3.10)
xt−1
2. Year-on-year growth rate: compares current quarterly growth to same

quarter last year:
xt − xt−4
(3.11)
xt−4

lOMoARcPSD|23815106
Week 4: Bivariate data, least

squares regression
4.1 Notation
Bivariate data: 2 variables
• We have (x1 , y1 ), (x2 , y2 ), ..., (xn , yn )
• We assue X influences Y
4.2 Cross Tables

Cross Table: summarises relationship between 2 discrete variables
• Cross tables not useful for continuous variables
It is common to include:
• Row percentages summing to 100% for each row
• Column percentages summing to 100% for each column
1. Row percentages 2. Column percentages
4.3 Scatter Plots

Scatter Plot: summarises relationship between 2 continuous variables
27

lOMoARcPSD|23815106
28CHAPTER 4. WEEK 4: BIVARIATE DATA, LEAST SQUARES REGRESSION
4.4 Bivariate distributions

Univariate case: We need to know Pr[X = x] Bivariate case: We need to know
Pr[X = x, Y = y]
• NOTE: Not just Pr[X = x], Pr[Y = y]
Example: Gender and voting behaviour
• Vertical total of rows: Shows gender distribution irrespective of political

affiliation = Pr[X = x] = marginal distribution of X
• Horizontal total of columns: Shows voting distribution irrespective of gen-
der = Pr[Y = y] = marginal distribution of Y
Conditional probabilities: Probability of Democrat given female and vice
versa?
P r[f emale, Democrat] 0.22
P r[Democrat|f emale] = = ≈ 0.431 (4.1)
P r[f emale] 0.51
Similarly:
P r[f emale, Democrat] 0.22
P r[f emale|Democrat] = = ≈ 0.564 (4.2)
P r[Democrat] 0.39

lOMoARcPSD|23815106
4.5. BAYES’ RULE 29
4.5 Bayes’ Rule

Generalising the previous results:
P r[X = x, Y = y]
P r[X = x|Y = y] = (4.3)
P r[Y = y]
Also:
P r[X = x, Y = y]
P r[Y = y|X = x] = (4.4)
P r[X = x]
Combining, we get Bayes’ Rule:
P r[Y = y|X = x] P r[X = x|Y = y]

= (4.5)
P r[Y = y] P r[X = x]
4.6 Independence
Dependence: Knowing value of one variable changes probability distribution of
other variable Independence: Knowing value of one variable does not change
probability distribution of other variable, i.e.:
P r[X = x|Y = y] = P r[X = x] (4.6)
And (from equations 4.3, 4.4):
P r[X = x, Y = y] = P r[X = x] · P r[Y = y] (4.7)
NOTE: It is useful to think of it as flipping a coin
4.7 Conditional means and variances

Conditional distributions are univariate distributions!
HENCE: Conditional means and conditional variances are computed in

same way as univariate means and variances

lOMoARcPSD|23815106
4.8 Covariance
Covariance: scaled measure of linear dependence
σXY = E[(X − µx )(Y − µy )] (4.8)
• σXY > 0: X and Y either both large or both small
• σXY < 0: One large, other small
• σXY = 0: X and Y are independent ⇒ σXY = 0
NOTICE THIS IS A ONE-WAY STATEMENT
2
NOTE: V ar[X] = σX = σXX
NOTE: Covariance only measures linear dependence (i.e. σXY ≈ 0 even when
non-linear relationship NOTE: Covariance is NOT scale-free (i.e. X in metres
instead of km increases σXY by 1000)
Sample covariance:
n
1 !
sXY = (xi − x̄)(yi − ȳ) (4.9)
n − 1 i=1
4.9 Correlation
Correlation: scale-free measure of linear dependence
σXY
ρXY = (4.10)
σX · σY
Sample correlation:
sXY
rXY = (4.11)
sX · sY
Correlation: the no. of SDs that y changes by when x changes by 1 SD

lOMoARcPSD|23815106
4.10. REGRESSION 31
Due to formula of correlation, points in quadrants have specific signs:
• rXY > 0: More in positive quadrants than negative quadrants
• rXY < 0: More in negative quadrants than positive quadrants
4.10 Regression
Regression: Finding line of best fit for scatter plot
i.e. find values b1 , b2 such that
ŷi = b1 + b2 xi (4.12)
is as close to yi for all i = 1, 2, ..., n as possible
• y is dependent variable, x is independent variable
• ŷ are fitted values (y values when subbed into straight line formula)
• b1 is intercept, b2 is slope
Residual: Distance between line and value
ei = yi − ŷi (4.13)
We want to minimise the residual when constructing our regression line
4.11 Least Squares Regression

How to make ei small:

lOMoARcPSD|23815106
$n
1. Minimise i=1 ei : Does not work as positive and negative residuals
cancel out
$n
2. Minimise i=1 |ei |: Mathematically complicated and has poor sta-
tistical properties
$n 2
3. Minimise i=1 ei : THIS WORKS; Least squares regression
We are trying to solve:

n
! n
! n
!
min ei 2 = min (yi − yî )2 = min (yi − b1 − b2 xi )2 (4.14)
b1 ,b2 b1 ,b2 b1 ,b2
i=1 i=1 i=1
Using calculus, the following optimisations are solved:

$n
(x − x̄)(yi − ȳ)
$n i
b2 = i=1 2
, b1 = ȳ − b2 x̄ (4.15)
i=1 (xi − x̄)
1
Note that for b2 , when multiplied by n−1 , the numerator is covariance, the
denominator is variance. Hence, we get:
sxy sxy · sy sy
b2 = 2
= = rxy · (4.16)
sx sx · sx · sy sx
4.12 The meaning of b1 and b2

b1 : The intercept
• The fitted value of y for x = 0
• NOTE: Sometimes doesn’t make economic sense
b2 : The slope
• Predicted change in y given a unit change in x

dŷ
• b2 = dx
4.13 Point forecasts

Sub in x (independent variable) values to get predicted y (dependent variable)
values
4.14 Correlation vs. Causation

We usually assume X causes changes in Y , but this may not be true

lOMoARcPSD|23815106
4.15. STANDARD ERROR OF REGRESSION 33
4.15 Standard error of regression

We want to see how ’good’ the regression line is. We look at the variance of ei :
n
1 !
(ei − ē)2 (4.17)
n − 1 i=1
Obviously, ē = 0. Also, we have estimated 2 coefficients b1 , b2 , so we actually

divide by n − 1, not n − 2. Hence, the standard error of regression (se ):
n
1 ! 2
se 2 = ei (4.18)
n − 2 i=1
Also known as root mean squared error
4.16 Sums of squares

3 measures to understand relationship between X and Y:
1. Total sum of squares (TSS): total variation in Y

n
! T SS
T SS = (yi − ȳ)2 , = V ar[y] (4.19)
i=1
dfT SS
2. Explained sum of squares (ESS): Variation explained by model

n
! ESS
ESS = (yî − ȳ)2 , = V ar[ŷ] (4.20)
i=1
dfESS
3. Residual sum of squares (RSS): Variation not explained by model

n
! RSS
RSS = (yi − yî )2 , = V ar[e] = s2e (4.21)
i=1
dfRSS
NOTE: T SS = ESS + RSS
4.17 Coefficient of determination

Coefficient of determination (R2 ): Proportion of total variation explained
by the model
ESS 2
R2 = , R ∈ [0, 1] (4.22)
T SS
Higher R2 means model fits better and vice versa. It turns out that:
R2 = rxy
2
= ry2ŷ (4.23)

lOMoARcPSD|23815106
4.18 Statistical assumptions

4.18.1 Assumption 1: Linear model
Assume that, for all i:
yi = β 1 + β 2 x i + u i (4.24)
where β1 , β2 are fixed population parameters, ui is an error term
That is, Y depends linearly on X in the population

NOTE: We don’t say anything about X, only Y’s dependence on X
4.18.2 Assumption 2: Mean independence

E[ui |xi ] = 0 (4.25)
That is, for any particular xi , ui = 0 on average
NOTE: This may be false if there is a pattern in ui values
4.18.3 Implication of Assumptions 1 and 2

Implication: The expected of yi given xi is β1 + β2 xi , i.e.:
E[yi |xi ] = β1 + β2 xi (4.26)
The population regression line is just E[y|x]
4.18.4 Assumption 3: Homoskedasticity

V ar[ui |xi ] = σu 2 (4.27)

lOMoARcPSD|23815106
4.18. STATISTICAL ASSUMPTIONS 35
That is, the variance of the error term as x changes is constant. Otherwise,
heteroskedastic, e.g.:
4.18.5 Assumption 4: Independence
Assume that, for all i, j where i ,= j:
ui , uj independent (4.28)
That is, where yi is relative to the regression line does not affect where yi+1 will
be relative to the line.
This is often not true for time-series data (e.g. if unemployment is high this
year, it will likely also be high next year):

lOMoARcPSD|23815106
4.18.6 Assumption 5: Sample assumptions

Assume that:
1. Sample variance of x is non-zero
2. There are more than 2 data points
4.18.7 Implication of assumptions

From Assumptions 1 & 2: Proves the least squares estimators are unbiased,
i.e.:
E[b1 ] = β1 , E[b2 ] = β2 (4.29)
From Assumptions 3 & 4:
We need to estimate σu 2 .
E[se 2 ] = σu 2 . Hence, replace σu 2 with se 2 .

lOMoARcPSD|23815106
4.19. HYPOTHESIS TESTING 37
Coefficient standard errors se(b1 ), se(b2 ): Roots of V ar[b1 ], V ar[b2 ] when

replace σu 2 with su 2
4.19 Hypothesis testing

4.19.1 Normality and t statistics
Assume ui is normally distributed, so b1 , b2 are normally distributed too. Al-
ternatively, rely on CLT.
Similarly to the univariate case, the t statistic for b2 is:

b2 − β 2
∼ tn−2 (4.30)
σb2
4.19.2 Significance tests

Generally, we have:
H0 : β2 = 0, H1 : β2 ,= 0 (4.31)
where rejecting H0 means X explains some variation in Y. We call this test the
coefficient significance test.
4.19.3 Confidence intervals

Confidence interval for b2 :
b2 ± tn−2 · se(b2 ) (4.32)
4.20 Optimality properties

The estimators b1 , b2 are the best linear unbiased estimators:
• They are unbiased

• Variance approaches 0 as n → ∞, so consistent
• By Assumptions 1-4: b1 , b2 have smallest variance among all possible un-
biased estimators that are linear functions of y
We call this result the Gauss-Markov theorem: our OLS estimators are
BLUE (Best Linear Unbiased Estimators).

lOMoARcPSD|23815106

lOMoARcPSD|23815106
Week 5: Inference for bivari-

ate regression
5.1 Robust Standard Errors

5.1.1 Relaxing the assumptions
Some results require Assumptions 3 and 4. If either assumption is wrong, the
variance of the estimators will also be wrong. Our default standard errors are
no longer applicable. To perform inference on b2 , we must use robust stan-
dard errors.
3 cases where robust standard errors must be used:

1. Robustness to heteroskedasticity
2. Robustness to autocorrelation
3. Robustness to clustering
5.1.2 Heteroskedasticity
Heteroskedasticity: Drop Assumption 3
• V ar[ui |xi ] is not constant at σu 2 for each i.
• The formula for se(b2 ) is incorrect as it depends on σu 2 , which no longer
exists.
Adjusting standard errors (NOTE: not examinable):

• White (1980)
• Even is Assumption 3 is false, E[ui |xi ] = 0
• Hence, let V ar[ui |xi ] = E[ui 2 |xi ], i.e. the individual ei 2 rather than the
common se 2
Heteroskedasticity-robust standard errors
• Robust standard errors larger → t statistics smaller, p-values larger →
rejecting H0 harder
• Stata: regress y x, robust adjusts standard errors
39

lOMoARcPSD|23815106
40 CHAPTER 5. WEEK 5: INFERENCE FOR BIVARIATE REGRESSION
5.1.3 Autocorrelation
Autocorrelation: Drop Assumption 4
• ui and uj not independent
• Very common for time-series data
Adjusting standard errors (NOTE: not examinable):
• Newey and West (1987)
• Estimate Cov[ui , uj ] by ei ej
• We only make this estimate for i, j close together: |i − j| < m
• For |i − j| ≥ m we maintain assumption that Cov[ui , uj ] = 0
Heteroskedasticity-and-autocorrelation-consistent HAC standard er-

rors
• If correct for autocorrelation, already correct for heteroskedasticity
• Stata: newey unemp infl lag(m) where m is from |i − j| < m
• Choosing m (2 methods):
√
3
– m≈ T , T is sample size of time-series data
– By observation (e.g. corr[et , et−8 ] > 0.2 but corr[et , et−9 ] < 0.2,
choose m = 8
– NOTE: m = 0 ⇒ heteroskedasticity-robust standard error
5.1.4 Clustering
Clustering: Special type of autocorrelation
• Panel data: correlation across time, but not individuals (i.e. ui,t correlated
with ui,t−1 , but not ui−1,t )
• Cross-section data: e.g. grades of students of same subject taught by

different lecturers
Cluster-robust standard errors
• Cluster-robust standard errors: df = G − 1 where G is number of clusters.

Why: each cluster only gives limited amount of information
• Stata: regress y x, vce(cluster variablename)

lOMoARcPSD|23815106
5.2. PREDICTION 41
5.2 Prediction
5.2.1 Point forecasts
Forecast: given data point x∗ , what is out best prediction of corresponding y ∗ ?
Point forecast: use y ∗ = b1 + b2 x∗
To build confidence intervals, we need the standard error of the forecast. How
to find depends on whether we want to forecast:
1. Conditional mean: β1 + β2 x∗
2. Actual value: β1 + β2 x∗ + u∗
5.2.2 Forecasting the conditional mean

Conditional mean: Conditional mean of y ∗ is E[y ∗ |x∗ ] = β1 + β2 x∗
Forecast variance becomes smaller when:

• Smaller σu 2 (less noise in population)
• Larger n (more data)
• Larger sx 2 (more varied data)
• Smaller (x∗ − x̄)2 (forecast point closer to sample mean)
Confidence intervals for conditional mean
Stata:
• Add x∗ , leave y ∗ empty, regress y x
• Execute predict yhat, yhat is the point forecast ŷ ∗
• Find standard error of yhat with predict x, stdp, where stdp is standard
error of prediction
5.2.3 Forecasting the actual value

Actual value: Expected forecast for actual value of y∗ is E[y ∗ |x∗ ] + u∗

lOMoARcPSD|23815106
42 CHAPTER 5. WEEK 5: INFERENCE FOR BIVARIATE REGRESSION
5.2.4 Comparison
Hourly wage example:

lOMoARcPSD|23815106
Week 6: Bivariate Regres-

sion and Transformations
6.1 Introduction
Data transformations: f (y) = β1 + β2 g(x) + u instead of y = β1 + β2 x + u
We must be careful with the interpretation of the results:
• Marginal effects: marginal effect of x on y
• Retransformation bias: transformations make estimations biased
6.1.1 Easiest transformation: Changing units

Changing units: a linear transformation (i.e. x → kx, y → cy)
Changing the units of x

Changing units of x: x ⇒ kx (e.g. k = 1000 for km → m)
β2
y = β1 + β2 x + u ⇒ y = β1 + (kx) + u
k
Changing the units of y

Changing units of y: y ⇒ cy
y = β1 + β2 x + u ⇒ cy = cβ1 + cβ2 x + cu
6.2 Transformations of x
6.2.1 Dummy variables
Dummy variables introduction
Dummy variable: variable only taking the values 0 or 1 (denoted di )
43

lOMoARcPSD|23815106
44CHAPTER 6. WEEK 6: BIVARIATE REGRESSION AND TRANSFORMATIONS
Regression on a dummy variable
• b2 is an estimate for the difference in means between the 2 groups (1

and 0)
• Hence H0 : b2 = 0 is equivalent to H0 : µ1 = µ2
Example Do mean wages differ significantly between those who did and didn’t
graduate high school?
We transform x, which measures education in years:
Transformations of the regressor
Typically, we regress y on g(x), rather than transforming x:
6.2.2 Marginal effects

The marginal effect
∆ŷ
Marginal effect: ∆x
Model Slope Interpretation
∆ŷ
ŷ = b1 + b2 x b2 = ∆x Expected change in y given unit change in x
∆ŷ
ŷ = b1 + b2 g(x) b2 = ∆g(x) E.g. Expected wage diff. between 2 groups
Recovering the marginal effect

∆ŷ ∆ŷ
Recover ∆x from ∆g(x) :

lOMoARcPSD|23815106
6.2. TRANSFORMATIONS OF X 45
Different marginal effects

Notice that the marginal effect is not constant (i.e. different for every x).
E.g. Marginal effect is 0 until you graduate high school
Hence we need a way to summarise the different marginal effects.
Summarising the marginal effect:

• Average Marginal Effect (AME): Average of marginal effects across all
individuals
$n
i.e. n1 i=1 M E(xi )
• Marginal Effect at the Mean (MEM): M E(x̄)
• Marginal Effect at a Representative value (MER): M E(x∗ ), for some x∗
6.2.3 Other transformations

Generic case
∆ŷ
For ŷ = b1 + b2 g(x), b2 = ∆g(x) . The marginal effect is:
∆ŷ ∆g(x)
= b2 · (6.1)
∆x ∆x
g(x) can be anything, but it is usually:
1. Dummy indicator function
2. Natural logarithm
The linear-log model

Linear-log model: yi = β1 β2 ln(xi ) + ui
∆y
• b2 = ∆ln(x)
∆y ∆y
Note that because ∆ln(x) ≈ ∆x x , ∆ln(x) ≈ ∆x
x
i.e. b2 is approximately equal to the change in y given a relative change in x
∆x
• Rearranging the above, we also get ∆ŷ ≈ b2 · x
i.e. given a relative change in x, y changes by that change in the slope

E.g. if x increased by 1 ⇒ ∆x x = 0.01 ⇒ ŷ increases by b2 · 0.01

lOMoARcPSD|23815106
Marginal effects in the linear-log model

Marginal effect:
∆g(x) ∆ln(x) 1
M E(g(x)) = b2 · = b2 · ≈ b2 · (6.2)
∆x x x
NOTE: Effect of ∆x on y decreases as x increases
Example Graduating high school has bigger association with wages than PhD
Linear model: wage

ˆ = −0.90 + 0.54educ
(extra year of education = extra $0.54/hr
Linear-log model: wage

ˆ = −7.46 + 5.33ln(educ)
(6th year raises wage by 5.33/6 per hour, 18th year raises by 5.33/18 per hour
- diminishing returns to education)
6.3 Transformations of y
6.3.1 Transforming y instead of x (Log-linear model)
We have:
• Population model: f (y) = β1 + β2 x + u

ˆ = b1 + b2 x
• Estimated regression: f (y)
ˆ
∆f (y)
Clearly, b2 = ∆x
Marginal effect:
∆ŷ ∆ŷ ˆ
∆f (y) ∆ŷ
= · = b2 · (6.3)
∆x ˆ
∆f (y) ∆x ˆ
∆f (y)
The log-linear model

Log-linear model: ln(ŷ) = b1 + b2 x

lOMoARcPSD|23815106
6.4. TRANSFORMING BOTH X AND Y 47
∆ln(ŷ)
• Slope coefficient: b2 = ∆x
∆ŷ
∆ŷ y
• ∆ln(ŷ) ≈ ŷ ⇒ b2 ≈ ∆x
• i.e. b2 is the predicted relative change in y given a unit change in x
• b2 is also called the semi-elasticity of y with respect to x
Marginal effects in the log-linear model

Marginal effect:
∆ln(ŷ) 1
M E(ln(ŷ)) = b2 / ≈ b2 / = b2 · ŷ (6.4)
∆ŷ ŷ
NOTE: Effect of ∆x on y increases as x increases
Example ln(wage) ˆ = 0.58 + 0.08educ

i.e. Every additional year in school increases wages by 8%
6.4 Transforming both x and y
6.4.1 Log-log model

Log-log model: ln(ŷ) = b1 + b2 ln(x)

lOMoARcPSD|23815106
∆ln(ŷ) ∆ŷ/ŷ
• Slope coefficient: b2 = ∆ln(x) ≈ ∆x/x
• Slope coefficient now measure relative change in y for every relative change
in x
• b2 is the elasticity of y with respect to x
6.4.2 Marginal effects in the log-log model

Marginal effect:
∆ŷ ŷ
ME = ≈ b2 · (6.5)
∆x x
NOTE: Effect of ∆x on ŷ is again non-constant
6.5 Retransformation bias

Retransformation bias: Obtaining ŷ back from ln(ŷ) is not as simple as
exp(ln(ŷ)), which is a biased estimate of E[y|x].
6.5.1 Avoiding retransformation bias

Given usual assumptions and normality of errors, the unbiased estimator
of E[y|x] is:
σu 2
ŷ = exp( ) · exp(ln(ŷ)) (6.6)
2
But we don’t know σu 2 , so we replace it with se 2
WARNING: This formula DOES NOT WORK if assumptions fail (especially

heteroskedasticity) or errors are not normally distributed
6.6 Choosing a model

Mathematically: No log transformation should be used if a variable has negative
values
Economically:

lOMoARcPSD|23815106
6.7. SUMMARY 49
Statistically: Look at data on scatter plot

• Long right tail? (Right-skewed?)
Statistically: use the coefficient of determination, R2
• higher R2 means model fits data better
• R2 is valid for comparing models with same y with different x

• R2 is invalid for comparing models with different y
– Because R2 measures how well model fits left-hand side variables
6.7 Summary
Why log transformations:

1. Economic data is likely to be right-skewed
2. We are usually interested in elasticity

lOMoARcPSD|23815106

lOMoARcPSD|23815106
Week 8: Multivariate Data
NOTE: There was no additional content in Week 7 due to the mid-semester

test.
7.1 Problem with bivariate regression

7.1.1 Example: Death rate in car accidents, speed
Bivariate regression model:
Death rate in car accidents = β1 + β2 speed + ui (7.1)
Problem: other factors that might affect death rates are absorbed to ui
Example: Safety is another factor as

Cov(Speed, Saf ety) > 0 (7.2)
Better technology = better speed and better safety
Then:
u = aβ3 ∗ Saf ety + other random errors, a < 0 (7.3)
Implication: b2 will be
• Biased
• Inconsistent
– b2 will actually converge to:
Cov(Speed, Saf ety)
β2 + β3 ∗ (7.4)
V ar(Speed)
– If β2 > 0, b2 can be negative if Cov(Speed, Saf ety) is strong enough
as a < 0
7.1.2 Omitted Variable Bias

Omitted Variable Bias: The omission of a variable Z (e.g. safety), which is
correlated with both the dependent variable (e.g. death rate) and regressor
(e.g. speed), will make OLS estimators biased and inconsistent.
• Can occur in multivariate regression, most common in bivariate
51

lOMoARcPSD|23815106
52 CHAPTER 7. WEEK 8: MULTIVARIATE DATA
7.2 Multivariate analysis: The general plan

Steps in multivariate analysis:
• Data description
– Plots
– Graphs
• Model: Multivariate regression
– Model and parameters of interest

– Estimation
– Assumptions and properties
– Inference
– Interpretation
– Prediction
– Evaluation and Comparison
– Important Special Cases
– Misspecification of the model
7.3 Data Description

7.3.1 Plots
If we have 3 variables, we can make a three-way scatter plot:

lOMoARcPSD|23815106
7.4. MULTIVARIATE REGRESSION 53
7.3.2 Graphs
If we have more than 3 variables, we need other methods:
• 3D scatter plot with colour
• Animation for time-series data
• Scatter plot table
Example: Scatter plot table - plots each pairing of variables
7.4 Multivariate Regression

7.4.1 Model and parameters of interest
We have:
• 1 dependent variable of interest (Y )
• Several independent variables (X)
Notation for k random variables:
• Y : dependent variable, outcome, LHS variable
• X2 , X3 , ..., Xk : covariates, explanatory variables, independent variables,

RHS variables, regressors
– NOTE: There is no X1 because β1 is independent from X as it is the

slope

lOMoARcPSD|23815106
So our model is now:
Y = β1 + β2 X2 + β3 X3 + ... + βk Xk + u (7.5)
Our parameters of interest are β1 , β2 , ..., βk
7.4.2 Estimation
Our line which fits the data best is:
y = b1 + b2 x2 + b3 x3 + ... + bk xk (7.6)
For each individual i, the realisations of X2 , X3 , ..., Xk are:
x2i , x3i , ..., xki (7.7)
The prediction of y given an individual i is:
ŷi = b1 + b2 x2i + ... + bk xki (7.8)
The OLS estimator for β1 , β2 , ..., βk is the set of values for b1 , b2 , ..., bk that
solves:
n
1!
minb1 ,b2 ,...,bk (yi − yî )2 (7.9)
n i=1
or equivalently
n
1!
minb1 ,b2 ,...,bk (yi − (b1 + b2 x2i + ... + bk xki )2 (7.10)
n i=1
or equivalently
n
1! 2
minb1 ,b2 ,...,bk (e) (7.11)
n i=1
To minimise, we need to find k different coefficients: b1 , b2 , ..., bk
We solve a system of k linear equations:

n
!
ei = 0
i=1
!n
xji ei = 0 for j = 2, ..., k
i=1
Implications:
• The residuals sum to 0
• Each regressor is orthogonal to the residual
Condition on the data for this system to have a unique solution: we need
adequate variations in the data on all the x values (more on this in the next
lecture)

lOMoARcPSD|23815106
7.4.3 Interpretation
Interpretation for b2 : the partial effect on the predicted value of y when X2
changes by one unit, holding X3 , ..., Xk fixed.
Equivalently, if we have 2 individuals having the same X3 , X4 , ..., Xk , we ex-

pect their difference in predicted Y to be b2 when their X2 differs by 1 unit.
i.e. The effect of x2 ceteris paribus
NOTE: Generally, the value of bj is different from slope of bivariate regression

on y and xj
Changing more than one variable simultaneously

To find the effect of a change in multiple variables, just use the regression line.
Example:
The aggregate effect of an individual staying at a firm for an extra year (i.e.
both experience and tenure increase by 1) is:
ˆ
∆ln(Earnings) = 0.029∆Experience + 0.011∆T enure
= 0.029 + 0.011
= 0.040
≈ 4% in earnings
7.4.4 Fitted values and residuals

Fitted or predicted value:
yî = b1 + b2 x2i + b3 x3i + ... + bk xki (7.12)
Residual:
ei = yi − yî (7.13)

lOMoARcPSD|23815106
Properties of fitted values and residuals:

1. Sample average of residuals = 0
2. Sample covariance between independent variables and residuals = 0 ⇒
Sample covariance between fitted values and residuals = 0
3. The point (x¯2 , x¯3 , ..., x¯k , ȳ) is always on the regression line
7.4.5 Goodness of fit

Note: All these measures are inappropriate for regressions without an intercept
(more about this when we talk about dummy variables)
Methods of measuring goodness of fit:

• Standard error of the regression
• R2
• Adjusted R2
Standard error of the regression: change in degrees of freedom
%
& n
& 1 !
se = ' (yi − yî )2 (7.14)
n − k i=1
R2 : no change from bivariate

n
!
T SS = (yi − ȳ)2
i=1
n
!
ESS = (yî − ȳ)2
i=1
n
!
RSS = (yi − yî )2
i=1
T SS = ESS + RSS
ESS RSS 2
R2 = =1− , R ∈ [0, 1]
T SS T SS
Problem: R2 always increases when we add more regressors to the model ⇒ we
have to use adjusted R2
Adjusted R2 (R¯2 ):
n − 1 RSS
R¯2 = 1 − · R¯2 ∈ (−∞, 1]
n − k T SS
R¯2 is the balance between degrees of freedom and prediction accuracy.

This refers to the fact that more variables means:
1. Higher R2 , higher prediction accuracy

lOMoARcPSD|23815106
2. Fewer df , higher se , lower inference accuracy/reliability

We also have the following properties (This will make sense when we cover F
tests):
• Higher R¯2 ⇐⇒ |t| > 1 for an additional regressor
• Higher R¯2 ⇐⇒ |F | > 1 for the joint significance of additional regressors
Example:
7.4.6 STATA Output

Sample STATA output:

lOMoARcPSD|23815106

lOMoARcPSD|23815106
Week 9: Inference for Mul-

tiple Regression Models
8.1 Assumptions
8.1.1 Data assumptions
Conditions needed on data to have a unique solution b1 , b2 , ..., bk :
1. Strictly more than k observations (so n − k > 0)
2. Adequate variation on the regressors - no perfect collinearity
Adequate variation: no regressor in the model can be expressed as an exact
linear combination of the other regressors
Example 1: Regress earnings on age, gender Assume the true relation

to be:
Suppose our dataset only includes males, so di = 1 for all i. i.e. there is
no variation in di in our sample.
This means we can write:
Earnings = 10 + 2 ∗ Age + D
= 10 + 2 ∗ Age + 1 as D = 1
= 11 + 2 ∗ Age
= 11 ∗ D + 2 ∗ Age as D = 1
= (11 − α) ∗ D + α ∗ D + 2 ∗ Age
= (11 − α) ∗ D + α + 2 ∗ Age as D = 1
So, now we have:
Earnings = (11 − α) ∗ D + α + 2 ∗ Age (8.1)
59

lOMoARcPSD|23815106
60CHAPTER 8. WEEK 9: INFERENCE FOR MULTIPLE REGRESSION MODELS
The best fitting line to this model will have:
• Intercept coefficient = α
• Slope coefficient on age = 2
• Slope coefficient on D = (11 − α)
for any α ∈ R. i.e. there are infinite solutions
Example 2: Regress earnings on age, school, experience Model:
Earnings = β1 + β2 Age + β3 School + β4 Experience + u (8.2)
True relation:
Earnings = 10 + Age + School + 2 · Experience + u (8.3)
Now assume that everyone in our dataset enters school at age 6 and works
as soon as they leave school. Then, we get the linear dependence relationship
(which we call perfectly collinear regressors:
Experience = Age − School − 6 (8.4)
Plugging this into our model, we get
Earnings = 10 + Age + School + 2 · (Age − School − 6) + u = −2 + 3 · Age − School + u
Which means the variable Experience is irrelevant, thus giving infinitely many
solutions.
Multicollinearity
Perfectly collinear regressors: regressors have a linear relationship
Multicollinearity: one or more of the regressors are very close to being an

exact linear combination of the other regressors. What this means is the regres-
sors likely measure the same things, incorporating the same information.
Implications of multicollinearity:
• Good
– We do get unique solutions

– STATA runs fine
• Bad: The estimated coefficients are very
– Imprecise: Large standard errors = inference less reliable

– Unstable: adding/deleting observations drastically changes estimates

lOMoARcPSD|23815106
8.2. PROPERTIES OF THE ESTIMATORS 61
8.2 Properties of the estimators

Properties of the residuals:
• The residuals sum to 0
• Each regressor is orthogonal (uncorrelated) to the residuals

$n $n
• i=1 yî = i=1 yi
• Each yî is orthogonal (uncorrelated) to the residuals
Example: Regress house price on bedrooms, size, bathrooms, lot size,

age, months old Model:
P rice = β1 + β2 Bedrooms + β3 Size + β4 Bathrooms + β5 LotSize

+ β6 Age + β7 M onthsOld + u
We can see the aforementioned properties in STATA output:

lOMoARcPSD|23815106
8.3 Inference Part 1

8.3.1 Assumptions
Data assumptions
Assumptions: Ensures OLS coefficients are computable
1. Sample size (n) greater than number of regressors (k): n > k
2. Regressors are not perfectly collinear with each other
Population assumptions
Population assumptions: similar to bivariate assumptions
1. Linear model:
Yi = βi + β2 X2i + ... + βk Xki + ui for all i (8.5)
2. Error has mean zero, unrelated to regressors: (Endogeneity)
E[ui |X2i ...Xki ] = 0 (8.6)
3. Homoskedasticity: (Heteroskedasticity)
V ar[ui |X2i ...Xki ] = σu 2 (8.7)
4. Independent errors: (Autocorrelation)
ui , uj independent for all i ,= j (8.8)
Interpretation
• Assumptions 1 & 2: ensures estimators are unbiased and consistent
• Assumptions 3 & 4: determines precision and distribution of estimators

lOMoARcPSD|23815106
8.3. INFERENCE PART 1 63
8.3.2 Properties of OLS estimators

Under Assumptions 1-4, we get the following properties:
1. OLS estimates are unbiased for population parameters:
E[bj ] = βj (8.9)
2. Variance of the OLS estimator bj is
σu 2
V ar[bj ] = σbj 2 = $ 2 (8.10)
x̃ji
where x̃ji is the residual from regressing xji on n intercept and all regres-
sors other than itself
i.e. xj = β1 + β2 x2 + ... + βj−1 xj−1 + βj+1 xj+1 + ... + βk xk
Example: j = 3 Regress x3i on x2i , x4i , ..., xki (with intercept) using
the same data set. The residuals of this regression is x̃3i .
Comments on variance:
• V ar[bj ] smaller when V ar[ui |X2i ...Xki ] = σu 2 smaller
• V ar[bj ] smaller the less xj is explained by other regressors (i.e. less

multicollinearity)
• As n → ∞, V ar[bj ] → 0 (i.e. the estimators are consistent)
3. We do not know σbj 2 as we don’t know σu 2 . So, we estimate σu with se :
se
se(bj ) = #$ (8.11)
x̃2ji
The t-statistic is obviously
bj − β j
t= (8.12)
se(bj )
with the distribution being Tn−k as we have df = n−k. The approximation

is exact if:
• n→∞
• Errors are normally distributed

lOMoARcPSD|23815106
4. If:
• n→∞
• Errors are normally distributed
then
bj ∼ N (βj , σbj 2 ) (8.13)
5. OLS estimators are BLUE:
• Best Linear Unbiased Estimator

• Best = minimum variance
If errors are normally distributed, the estimators are BUE
8.3.3 Hypothesis testing: single parameter

t statistic:
bj − β j
T = (8.14)
se(bj )
Note that se(bj ) → 0 as n → ∞
Degrees of freedom:
df = n − k (8.15)
Hypotheses interpretation: β −j is the relationship between Xj , Y holding the

values of other regressors fixed
• H0 : βj = 0 — Xj has no partial effect on the expected value of Y after

controlling for all other explanatory variables
• Ha : βj ,= 0 — Xj has a partial effect on the expected value of Y after

controlling for all other explanatory variables

lOMoARcPSD|23815106
8.3. INFERENCE PART 1 65
NOTE: We assume that all other regressors are non-zero in each hypothesis.
Interpretation of the test: ”Given our other regressors in our population model,
do we still need this regressor?”
e.g. X1 and X2 likely measure similar things (high correlation, multicollinearity)

⇒ both t-statistics liekly low. But it dosn’t mean we don’t need both! Given
X1 , we don’t need X2 , and vice versa. We need to check the join significance of
the regressors.
So what we must do is check the correlation table for multicollinearity before

concluding that a regressor can be omitted (it might share the same information
as another regressor).
T-test:
1. Setting up hypotheses:
H0 : β j = 0
Ha : βj ,= 0
2. Pick a significance level α
3. Calculate the t-statistic:

bj − β j
t= (8.16)
se(bj )
4. Evaluating the t-statistic
• p-value: probability of observing a t-statistic at least as extreme as

the one calculated
p − value = P r[|Tn−k | ≥ |t|] (8.17)
• Critical value:
c = tn−k,α/2 (8.18)
5. Conclusion
• Reject: Xj is statistically significant at the α% level

• Conclude using the economic context of the question
6. STATA Commands
• Critical value tn−k,α/2

display invttail (<n-k>, <a/2>)
• p-value P r[|Tn−k | ≥ a]
display ttail (<n-k>, <a>)
Single-tailed t-test:

lOMoARcPSD|23815106
1. Setting up hypotheses:
H0 : β j ≥ 0
Ha : β j < 0
or vice versa (H0 is what we want to reject, Ha is our claim)

2. Pick a significance level α
3. Calculate t-statistic
4. Evaluate the t-statistic

• p-value: probability of observing t-statistic at least as large/small as
the t-statistic we calculated
p = P r[Tn−k ≥ t] or P r[Tn−k ≤ t] (8.19)
≤, ≥ is the same as Ha direction

5. Conclusion
Confidence intervals:
• x% CI:
x%CI = bj ± tn−k,x/2 · se(bj ) (8.20)
• Interpretation
– We are x% confident that the CI covers the true population parameter

βj
– If random samples were obtained over and over again, with the CI
computed each time, then x% of the CIs will contain the population
parameter βj
• If df > 120, estimate using t∞

lOMoARcPSD|23815106
Week 10: Inference for Mul-

tiple Regression Models
9.1 Hypothesis Testing: One Linear Combina-

tion of Parameters
Consider a hypothesis test involving two parameters (linear restriction):
9.1.1 Example: Standard earnings (wages)

Model:
ln(earning) = β1 + β2 S + β3 exper + u (9.1)
Question: Does an extra year of formal education (S) have the same effect
on the natural logarithm of earnings (ln(earning)) as an extra year of general
workforce experience (exper)?
H0 : β 2 = β 3
Ha : β2 ,= β3
Which is equivalent to:

H0 : β 2 − β 3 = 0
Ha : β2 − β3 ,= 0
t-statistic:
b 2 − b3
t= (9.2)
se(b2 − b3 )
9.1.2 Calculating se(b2 − b3 )

Note that:
#
se(b2 − b3 ) = V ar(bˆ2 − b3 )
V ar(b2 − b3 ) = V ar(b2 ) + V ar(b3 ) − 2Cov(b2 , b3 )
67

lOMoARcPSD|23815106
We will rewrite the model such that the STATA output will directly provide
se(b2 − b3 ). Define:
θ = β2 − β3 (9.3)
Then our model becomes:
ln(earning) = β1 + β2 S + β3 exper + u
= β1 + (θ + β3 )S + β3 exper + u
= β1 + θS + β3 (S + exper) + u
= β1 + θS + β3 X4 + u Define X4 = S + exper
STATA now gives us:
Our test is now:
H0 : θ = 0
Ha : θ ,= 0
Our t-statistic:
θ̂ − 0
t= (9.4)
se(θ̂)
9.2 Hypothesis testing: More than One Linear

Restriction - The F-Test
We now consider whether a group of variables has an effect on the dependent
variable.

lOMoARcPSD|23815106
9.2. HYPOTHESIS TESTING: MORE THAN ONE LINEAR RESTRICTION - THE F-TEST69
We are now doing joint hypothesis tests: more than one restriction on the
parameters.
9.2.1 Example: Parents’ education and child’s birth weight

Independent variables:
• bwght: birth weight
• cigs: average no. of cigs mother smoked per day during pregnancy
• parity: birth order
• faminc: family income
• motheduc: years of education (mother)
• fatheduc: years of education (father)
Our model:
bwght = β1 + β2 cigs + β3 parity + β4 f aminc + β5 motheduc + β6 f atheduc + u

(9.5)
Our question: should motheduc and fatheduc be excluded from the model
after other variables have been controlled for (2 exclusion restrictions)?
H0 : β5 = 0, β6 = 0
Ha : H0 is false
NOTE: Ha holds if either β5 , β6 ,= 0. It is not appropriate to test this joint

hypothesis by considering 2 separate t-tests. Consider the case where we do this
at a 5% significance level:
TEST 1
H0 : β 5 = 0
Ha : β5 ,= 0
b5
REJECT IF | | ≥ tn−k,0.25%
se(b5 )
TEST 2
H0 : β 6 = 0
H0 : β6 ,= 0
b6
REJECT IF | | ≥ tn−k,0.25%
se(b6 )
We reject the joint hypothesis if either are rejected, so the probability of re-
jecting is:
P r[|t5 | ≥ tn−k,0.25% or |t6 | ≥ tn−k,0.25% ] (9.6)

lOMoARcPSD|23815106
This probability depends on Corr(t5 , t6 ), which we do not know. So, we can-

not know the probability of Type I error.
Consider the possible probabilities:
IF INDEPENDENT
1 − P r[|t5 | ≥ tn−k,0.25% and |t6 | ≥ tn−k,0.25% ]
1 − P r[|t5 | ≥ tn−k,0.25% ] · P r[|t6 | ≥ tn−k,0.25% ]
1 − 95% · 95%
9.75%
IF PERFECTLY CORRELATED
P r = 5%
The F-Test takes the correlation structure into account.
9.2.2 The F-test

F-test:
• F-test is for two-sided alternatives only
• Only if errors are homoskedastic, the F-statistic can be computed based
on the RSS/R2 when the models are estimated with, then without, the
restrictions imposed
Restricted/Unrestricted models
Unrestricted model (UR): Complete/original model
bwght = β1 + β2 cigs + β3 parity + β4 f aminc + β5 motheduc + β6 f atheduc + u

(9.7)
Restricted model (R): Imposing restrictions in null hypothesis
bwght = β1 + β2 cigs + β3 parity + β4 f aminc + u (9.8)
NOTE:
• RSS of Model R ≥ RSS of Model UR
• R2 of Model UR ≥ R2 of Model R
The F-statistic
F-statistic:
RSSR −RSSU R
q
F = RSSU R
(9.9)
n−k
where:

lOMoARcPSD|23815106
• q is the number of restrictions in H0
As T SSR = T SSU R , this is equivalent to:
2 2
RU R −RR
q
F = 1−RU 2 (9.10)
R
n−k
TIP: To handle changes in sign, always make the larger value minus the smaller
value
When errors are heteroskedastic, these formulas are no longer valid
The F-statistic is approximately F (q, n − k) distributed. The distribution is

exact when:
• The data is normally distributed
• n→∞
F-distribution can be denoted F (v1 , v2 ) or Fv1 ,v2 where v1 , v2 are the first (q)
and second (n − k) degrees of freedom

lOMoARcPSD|23815106
In our example:
2 2
• RU R = 0.0387, RR = 0.0364
2
RU 2
R −RR
q
• So F = 1−R2
= 1.42
UR
n−k
Critical value of the F-test

Reject if F > critical value. The critical value comes from the (1 − α) percentile
of the F (q, n − k) distribution.
Finding the critical value on STATA:
display invFtail (<q>,<n-k>,<a>)
F-distribution depends on 2 parameters ⇒ we have a table for each α. Use

F (q, ∞) for n − k > 120.
In our example:
• q = 2, n − k = 1191 − 6 = 1185, α = 5%
• Critical value:
display invFtail(2, 1185, 0.05)

3.0033184
• F < cv, so we do not reject the null hypothesis
• Thus, motheduc and fatheduc do not have a significant effect on bwght

after other variables have been controlled for

lOMoARcPSD|23815106
9.2.3 F-test General Procedure

Model:
Y = β1 + β2 X2 + ... + βk Xk + u (9.11)
Hypotheses:
H0 : q restrictions on the β & s holds jointly

Ha : H0 f alse
Models:
Y = β1 + β2 X2 + ... + βk Xk + u (UR)
Model with q restrictions (R)
F-statistic:
2 2
RSSR −RSSU R RU R −RR
q q
F = RSSU R
= 1−RU 2 if T SSU R = T SSR (9.12)
R
n−k n−k
Critical value:
(1 − α) quantile of the F (q, n − k) distribution (9.13)
Conclude:
• Reject: if F > Critical value
• Do not reject: otherwise
9.2.4 Joint significance tests

Joint significance tests: Testing whether q explanatory variables have co-
efficients equal to 0
H0 : βk−q+1 = 0, ..., βk = 0
Ha : H0 is false
Interpretation:
• H0 rejected: xk−q+1 , ..., xk are jointly statistically significant
• H0 not rejected: xk−q+1 , ..., xk are jointly statistically insignificant

(justifies dropping them from the model)
NOTE: It is possible that a group of variables are individually insignificant

but jointly significant

lOMoARcPSD|23815106
9.2.5 Special case: Overall significance

Overall significance test: are all of the k − 1 explanatory variables are jointly
significant
H0 : β2 = 0, .., βk = 0
Ha : H0 false
The F-statistic and p-value of this test is reported in STATA output:
Models:
Y = β1 + β2 X2 + ... + βk Xk + u (UR)
Y = β1 + u (R)
2
F-statistic: RR = 0, q = k − 1 for overall significance tests
2 2 2
RU R −RR RU R
q k−1
F = 1−RU 2 = 1−RU2 (9.14)
R R
n−k n−k
9.2.6 Special case: Testing one restriction

An F-test is equivalent to the t-test when only once restriction is tested, e.g.:
H0 : β j = 0
Ha : βj ,= 0
In this case:
• F − statistic = (t − statistic)2
• Critical value of F − test = (Critical value of t − test)2
• P r[|Tn−k | > |t|] = P r[Fn−k | > t2 ] = P r[F1,n−k > f ], i.e. same p-values

lOMoARcPSD|23815106
Week 11: Data Transforma-

tion
10.1 Data transformations

We require:
Dependent variable is linear in explanatory variables
Note that this is not equivalent to requiring
Y is linear in explanatory variables
So what we can have is
Transformation of Y is linear in transformation of explana-
tory variables
Multivariate transformations:
• Natural log
• Polynomials of regressors
• Dummy variables
• Interaction terms
10.2 Natural log

Take logs when:
• The rate of change of Y is linear in the level of Xj , holding all other
regressors constant
– Regress ln(Y ) on Xj and other regressors
• The level of Y is linear in the rate of change of Xj , holding all other
regressors constant
– Regress Y on ln(Xj ) and other regressors
• The rate of change of Y is linear in the rate of change of Xj , holding
all other regressors constant
– Regress ln(Y ) on ln(Xj ) and other regressors
75

lOMoARcPSD|23815106
76 CHAPTER 10. WEEK 11: DATA TRANSFORMATION
10.2.1 Example: Cobb-Douglas Production Function

Cobb-Douglas production function:
Y = AK α Lβ (10.1)
Suppose we want estimates of A, α, β. We make the following transformation

to a linear model:
ln(Y ) = ln(A) + αln(K) + βln(L) (10.2)
We regress ln(Y ) on ln(K) and ln(L). Our regression is:
ˆ ) = ln(A)
ln(Y ˆ + α̂ln(K) + β̂ln(L) (10.3)
To get an unbiased predicted value for Y , we must account for retransforma-

tion bias and use
2
ˆ )) · exp( se )
Ŷ = exp(ln(Y (10.4)
2
NOTE: Only valid if homoskedastic or normal errors
10.3 Polynomial models

Polynomial models: Fit non-linear relations
10.3.1 Quadratic model

Simple polynomial model: Quadratic model
Y = β1 + β2 X + β3 X 2 + u (10.5)
Here, we regress Y on X and X 2 .

NOTE: Y is not linear in X, Y is linear in X and X 2 .
Normally, when we regress Y on X, we are assuming E[Y |X] changes in the

same
• Direction
• Amount
as X increases by 1 unit.
The quadratic model assumes
E[Y |X] = β1 + β2 X + β3 X 2 (10.6)
This is appropriate when:

• Y increases with X, then Y decreases with X
• Y decreases with X, then Y increases with X
• Y changes slowly when close to a value of X, but Y changes fast when
far from that value of X

lOMoARcPSD|23815106
10.3. POLYNOMIAL MODELS 77
NOTE: Log models account for the case where
Y changes fast when close to a value of X, but Y changes slowly

when far from that value of X
Interpretation
In the quadratic model
Y = β1 + β2 X + β3 X 2 (10.7)
we do not interpret the individual coefficients as the partial effects. This is
because all regressors are necessarily dependent. Instead, we look at the:
Expected change in Y is X changes by 1 unit and hence both X

and X 2 change at the same time
NOTE:
Y changes by a different amount depending on the initial value

of X!
In fact, this is true for all polynomial models.
Examples We have the quadratic model
Y = β1 + β2 X + β3 X 2 (10.8)
X changes from 1 to 2: The predicted change in Y is
∆Y = (β1 + β2 2 + β3 22 ) − (β1 + β2 1 + β3 12 )
= β2 + 3β3

lOMoARcPSD|23815106
X changes from 10 to 11: The predicted change in Y is
∆Y = (β1 + β2 11 + β3 112 ) − (β1 + β2 10 + β3 102 )

= β2 + 21β3
Marginal Effect
Marginal effect: Predicted change ∆Y when X changes by a very small ∆X
∆Y
ME = (10.9)
∆X
This also depends on the value of X. We can interpret ME as the slope of
the E[Y |X] curve at X (derivative).
In the quadratic model:

M E = β2 + 2β3 X (10.10)
which is just the derivative of the quadratic model.
NOTE: In a quadratic model, AM E = M EM
Example - Regress earnings on education

Data:
Quadratic regression:

lOMoARcPSD|23815106
10.3. POLYNOMIAL MODELS 79
Estimated model:
ˆ
earnings = 29252.89 − 3830.641 · education + 439.5283 · education2 (10.11)
Education increases from 0 to 1: Predicted change in earnings
∆earnings = (−3830.641 · 1 + 439.5283 · 12 ) − (−3830.641 · 0 + 439.5283 · 02 )

= −3830.641 · 1 + 439.5283 · 12
= −3391.11
Education increases from 10 to 11: Predicted change in earnings
∆earnings = (−3830.641 · 11 + 439.5283 · 112 ) − (−3830.641 · 10 + 439.5283 · 102 )

= −3830.641 · 1 + 439.5283 · 21
= 5399.49
10.3.2 Cubic Model

Cubic model:
Y = β1 + β2 X + β3 X 2 + β4 X 3 + u (10.12)
The cubic model assumes that
E[Y |X] = β1 + β2 X + β3 X 2 + β4 X 3 (10.13)
Quadratic useful for capturing the relations where:

• Y changes slowly around a particular value of X, but changes fast (in
DIFFERENT directions) when far from this value of X
• ME changes sign ONCE
Cubic useful for capturing the relations where:
• Y changes slowly around a particular value of X, but changes fast (in
SAME directions) when far from this value of X
• ME changes sign TWICE

lOMoARcPSD|23815106
10.3.3 General case

We can include higher powers in our polynomial:
Y = β1 + β2 X 1 + β3 X 2 + ... + βp+1 X p + u (10.14)
Think about whether assuming the following makes sense
E[Y |X] = β1 + β2 X 1 + β3 X 2 + ... + βp+1 X p (10.15)
NOTE:
• If X p is included, it makes more sense to include all X m for m ≤ p
• Including many powers can be considered a nonparametric way of esti-
mating E[Y |X]
• The F-test can help us decide how many powers to use (generally overes-
timates)
We can include other regressors in our polynomial model:
E[Y |X] = β1 + β2 X 1 + β3 X 2 + ... + βp+1 X p + βp+2 X2 + ... + βp+k Xk (10.16)
Polynomial models are still multiple regression models. They just use
transformed data (similar to linear restriction transformations).

lOMoARcPSD|23815106
Week 12: Data Transforma-

tions Part II
11.1 Dummy variables

Dummy variable: incorporates binary data into regression models - include
dummy variables as explanatory variables
Example Consider the following simple model of wage determination
wage = β1 + δ1 f emale + β2 educ + u (11.1)
δ1 is used as the coefficient for the f emale variable to denote that it is a dummy
variable:
(
1 if individual i is f emale
f emalei = (11.2)
0 if individual i is male
11.1.1 Interpretation
We have:
δ1 = E[wage|f emale = 1, educ] − E[wage|f emale = 0, educ] (11.3)
We can interpret δ1 as the difference in mean wages between females and

males, at the same level of education.
So our model is:

(
β1 + δ1 + β2 educ f or f emale
wage = β1 + δ1 f emale + β2 educ (11.4)
β1 + β2 educ f or male
δ1 represents an intercept shift between the regression lines for males and
females
81

lOMoARcPSD|23815106
82 CHAPTER 11. WEEK 12: DATA TRANSFORMATIONS PART II
11.1.2 Dummy variables and perfect collinearity

Note that we cannot include a dummy variable for both male and females due
to perfect collinearity.
Consider the model
wage = β1 + δ1 f emale + δ2 male + β2 educ + u (11.5)
where (
1 if individual i is male
male1 = (11.6)
0 if individual i is f emale
Then, we have
f emalei + malei = 1 (11.7)
for all individualsi ⇒ perfectly collinear.
We can only use either of these two models:
wage = β1 + δ1 f emale + β2 educ + u

wage = β˜1 + δ2 male + β2 educ + u
These two models are equivalent:
wage = β˜1 + δ2 male + β2 educ + u

= β˜1 + δ2 (1 − f emale) + β2 educ + u
= (β˜1 + δ2 ) − δ2 f emale + β2 educ + u
So we have:
• β2 remains the same
• β˜1 =
, β1 , β1 = β˜1 + δ2 (i.e. intercept in Model 1 is the intercept in Model
2 plus δ2
• δ1 = −δ2

lOMoARcPSD|23815106
11.1. DUMMY VARIABLES 83
• The predicted regression lines for males and females remain the same
Notice the relationship between the following STATA output:
Just think about it case by case:
• What happens when male? f emale = 0, male = 1
• What happens when female? f emale = 1, male = 0
NOTE:
• The sum of the dummy variable coefficient and the intercept is always the
intercept of the other model
• Sum of sample means of dummy variables equals 1. This is because sample

means of dummy variables are sample proportions!

lOMoARcPSD|23815106
11.1.3 Categorical Variables with Multiple Groups

We can also have dummy variables that represent categorical data taking on
more than 2 values.
Example Relationship between quantity of ice creams sold at a grocery store

in a week and the season of the week
We cannot have the model:

ice = β1 + β2 season + u (11.8)
where 

 1 spring

2 summer
season = (11.9)


 3 autumn
4 winter

as this assumes that the differences in ice cream sales across 2 different seasons
are fixed as k · β2 for some integer k ∈ [1, 4].
E.g. Ice creams sold in summer - Ice creams sold in spring = 1 · β2 .
Instead, we must have a different dummy variable for each season. So,
now let’s consider the model:
ice = β1 + δ1 dspring + δ2 dsummer + δ3 dautumn + δ4 dwinter + u (11.10)
The problem here is that we have perfect collinearity:
dspring + dsummer + dautumn + dwinter = 1 (11.11)
So, we must either:
• Exclude 1 of the 4 variables
• Keep all 4 variables, exclude the intercept
These options are all equivalent. I.e. The conclusions will be the same.
Example Let’s exclude the spring dummy variable
We call spring the baseline group or the reference group (the coefficient
of spring has essentially become β1 ). We have the model:
ice = β1 + δ2 ds ummer + δ3 da utumn + δ4 dw inter + u (11.12)
where
β1 = E[ice|spring]
δ2 = E[ice|summer] − E[ice|spring]
δ3 = E[ice|autumn] − E[ice|spring]
δ4 = E[ice|winter] − E[ice|spring]

lOMoARcPSD|23815106
11.2. INTERACTION TERMS 85
This model does not have a problem with collinearity, given that we have ob-
servations from each season in our dataset.
Note: We can calculate the difference in means across groups by finding the
difference in their coefficients
E.g.
δ4 − δ2 = E[ice|winter] − E[ice|spring] − (E[ice|summer] − E[ice|spring])

= E[ice|summer] − E[ice|summer]
These coefficients are shifts in the intercepts of the regression lines for different
groups.
11.1.4 Hypothesis testing

To test whether season has an effect on ice, we just need to test for joint
significance:
H0 : δ 2 = δ 3 = δ 4 = 0 (11.13)
with the F-test.
11.1.5 Multiple Categorical Variables

Same procedure. Make sure to exclude one dummy variable for each
categorical variable. We have, in general, g − 1 dummy variables if we have
g groups.
11.2 Interaction terms

Interaction term: additional variable created to account for two variables as-
sumed to be dependent
We generally want to explain Y using X2 , X3 , ..., Xn :
• Without interaction terms, E[Y |X2 , ..., Xn ] changes with Xi in the

same way when other regressors are fixed at different levels
• With interaction terms, E[Y |X2 , ..., Xn ] changes with Xi differently

when other regressors are fixed at different levels
Example Add new dummy variable nonwhite to the earning model to account
for race
wage = β1 + β2 educ + δf emale f emale + δnonwhite nonwhite + u (11.14)
The assumption in this model is that the difference in wages between males/females
are the same for white/non-whites.

lOMoARcPSD|23815106
If this doesn’t seem like a reasonable assumption, we can add an interaction

term (a new variable:
f e nw = f emale · nonwhite (11.15)
where (
1 if f emale and nonwhite
f e nw = (11.16)
0 otherwise
So, now we have the model
wages = β1 + β2 educ + δf emale f emale + δnonwhite nonwhite + δf e nw f e nw + u

(11.17)
There is no perfect collinearity as long as we have observations for each of
the combinations:
• (f, nw)
• (f, w)
• (m, nw)
• (m, w)
So, we get the predicted mean wages:
NOTE:
• Include an interaction term between two dummy variables ⇒ different

intercepts in regression lines for each group
• Include and interaction term between dummy variable and continuous

variable ⇒ different slopes in regression lines for each group
Example Interaction term between gender and education
Let’s add the interaction term
educ f emale = educ · f emale (11.18)
to get the model
wages = β1 + β2 educ + β3 educ f emale + δf emale f emale + u (11.19)

lOMoARcPSD|23815106
11.2. INTERACTION TERMS 87
Then we get the predicated wages:
M ale :β1 + β2 educ

F emale :β1 + β2 educ + β3 educ · 1 + δf emale · 1
= (β1 + δf emale ) + (β2 + β3 )educ
which have different slopes and intercepts.
Here, to test whether there is discrimination against females, it makes sense

to test the joint null hypothesis:
H0 : β3 = δf emale = 0 (11.20)
We have the following models, which affect either the intercept or the slope:

lOMoARcPSD|23815106
11.3 The alternative

Another way to get different predicted values for different groups is to run re-
gressions separately for each group.
E.g. We can estimate the model
wages = β1 + β2 educ + u (11.21)
using observations on females only (Model 1) and then observations on males

only (Model 2). We get the following 2 models:
The results will be equivalent a model including a female dummy and a fe-
male/educ interaction term.
To test for discrimination against females, we need to check if the regression

coefficients are the same in Model 1 and Model 2.

lOMoARcPSD|23815106
11.4. CHOW TEST 89
11.4 Chow Test

Chow Test: Tests whether population regression functions are the same across
different groups
The Chow test is an F-test where we:
• Restricted model uses the full sample.
– This model imposes that the population regression coefficients are

the same across groups.
• Unrestricted model is the combination of regressions of each group.
– RSSU R = RSSU R,group1 + RSSU R,group2 + ... + RSSU R,groupn
• Specify q, n, k
• Compute the F-statistic
• Find critical value from the F-distribution
• Conclude
Example M/F wages and education

Our restricted model is
wages = β1 + β2 educ + u (11.22)
using the full sample.
We consider the RSS in this model: RSSR .
Our unrestricted model is
wages = β1 + β2 educ + u (11.23)
using data on females only to get RSSU R,f emale and then males only to get
RSSU R,male . Then, we use
RSSU R = RSSU R,f emale + RSSU R,male (11.24)
The alternative unrestricted model is
wages = β1 + β2 educ + β3 educ f emale + δf emale f emale + u (11.25)
using the full sample. We use the RSS of this model as RSSU R .
Now we consider our degrees of freedom:
• q = 2 as the null imposes that the 2 regression coefficients are similar

across groups
• k = 4 as we have 2 coefficients × 2 groups

lOMoARcPSD|23815106
• n is the sample size in the full sample
Now, let’s carry out our Chow test.
RSSU R = RSSU R,f emale + RSSU R,male = 30288 + 55929 = 86217

RSSR = 92688
F − statistic = 20.12
F (2, 536, 5
NOTE: The Chow test can only be used where all coefficients are the
same across groups. Otherwise, we have to use the F-test.
i.e. We can only use the Chow test for the null where all βi for all integers
i ∈ [1, n] are the same across groups.
Example We want to test wheter β2 , β3 are the same for males/females while
allowing β1 to differ in the model
wages = β1 + β2 educ + β3 tenure + u (11.26)
Then, we will have to use the F-test where:
• Unrestricted model includes the dummy variable and the full set of
interaction terms (between dummy and other variables)
– UR model has intercept, education, tenure, female, female educ,

female tenure
• Restricted model imposes all null hypothesis restrictions
– R model has only intercept, education, tenure, female
11.5 Binary Dependent Variables

Sometimes, our Y is binary:
• Actual observed values can only be 0 or 1
• Predicted values can be in (−∞, ∞)

lOMoARcPSD|23815106
11.5. BINARY DEPENDENT VARIABLES 91
We interpret the predicted values of Y as probabilities:
E[Y |X1 , ..., Xk ] = β1 + β2 X2 + ... + βk Xk

= 1 ∗ P r[Y = 1|X1 , ..., Xk ] + 0 ∗ P r[Y = 0|X1 , ..., Xk ]
= P r[Y = 1|X1 , ..., Xk ]
Thus, regressing a binary dependent variable Y on X1 , ..., Xk is called the linear

probability model.

lOMoARcPSD|23815106

lOMoARcPSD|23815106
Week 13: Model Misspecifi-

cation
12.1 Data checks

12.1.1 Errors and missing values
Generally, do not exclude observations with missing values.
• The fact that they are missing values provide information
• Excluding these can lead to selection bias
12.1.2 Outliers
Outlier: an observation whose value is unusual in the context of the rest of the
data
Outliers:
• Usually heavily influence the regression outcome
• Should not be simply dropped from the data
Importance of outlier screening:
• Avoid erroneous data
• Large impact on OLS estimations
Take care if your results change significantly when a few key observations (’in-
fluential observations’) are dropped. Results with these observations dropped
may be included as a robustness check.
Checking for outliers

Univariate case Box/Whisker plot
93

lOMoARcPSD|23815106
94 CHAPTER 12. WEEK 13: MODEL MISSPECIFICATION
Bivarate case Check the plot of Y on X
Multivariate case Common practice is to plot residuals against each regres-

sor. If there are too many regressors, plot the residuals against the fitted values
Ŷi
2 ways to quantify the effect of outliers:
• DFITS: change in fitted values
• DFBETA: change in coefficient estimates
12.2 Model Checks

Ways in which assumptions can be wrong:
• Wrong model chosen
– Omitted Variable Bias

– Irrelevant variables included
– Non-linear relationship
• Errors correlated with:
– Regressors (Endogeneity)

lOMoARcPSD|23815106
12.2. MODEL CHECKS 95
– Each other (Autocorrelation)
• Errors have non-constant variance (Heteroskedasticity)
12.2.1 Wrong model

Omitted Variable Bias
Suppose we use the model
Y = β 1 + β2 X2 + u (12.1)
despite the true model being
Y = β 1 + β2 X2 + β 3 X3 + u (12.2)
If all assumptions hold, regressing Y on X2 and X3 should give E[b2 ] = β2 .
But, we regress Y on X2 only. So X3 is the omitted variable. We get:
Ŷ = b˘1 + b˘2 X2 (12.3)
In general, E[b˘2 ] ,= β2 . If X2 is correlated with X3 , b˘2 picks up the effects of

both regressors. So, if we have
X3 = γ 1 + γ 2 X 2
then we get
E[b˘2 ] = β2 + β3 γ2
Don’t worry about how this is derived.
NOTE: We also have the error term absorb the OVB:
ui = u + β 3 X3
So, we see that E[b˘2 ] is equal to β2 plus:
• γ2 : The relationship between the omitted variable and the regressor X2
• β3 : The relationship between the omitted variable and the dependent

variable Y
So, as long as both γ2 , β3 ,= 0, we have E[b˘2 ] ,= β2 . b˘2 is a biased estimator

for β2 , the true population coefficient for X2 . This is omitted variable bias.
This formula allows us to determine the sign or direction of the bias. We

have
E[b˘2 ] = β2 + γ2 β3 (12.4)
Upward bias: E[b˘2 ] > β2 when
• γ 2 , β3 > 0

lOMoARcPSD|23815106
• γ 2 , β3 < 0
Downward bias: E[b˘2 ] < β2 when
• γ2 > 0, β3 < 0
• γ2 < 0, β3 > 0
No bias: E[b˘2 ] = β2 when
• γ2 = 0
• β3 = 0
Dealing with OVB:
• Simplest: Include the omitted variable in our regression
• Advanced techniques: instrumental variables, natural experiments, etc.
• Last resort: determine the sign of the bias
Example Determining the sign of the bias - Returns to Education
It is difficult to estimate returns to education due to personal ability being

an omitted variable:
• Higher ability people are more likely to go to school γ2 > 0
• Higher ability people are more likely to have higher-earning careers β3 > 0
We have γ2 , β3 > 0, so the bias is upwards.
Irrelevant variables
Implications of including irrelevant variables, or too many variables:
• OLS estimators remain unbiased and consistent
• OLS estimators are less precise
– Multicollinearity may be a problem
– Multicollinearity → se increases → t-test less accurate when n is
small
12.2.2 Endogeneity
In the regression model
Y = β1 + β2 X2 + ... + βk Xk + u (12.5)
if u is correlated with a regressor Xj , this regressor is endogenous.
Common sources of endogeneity:

• OVB (as ui = u + βk Xk when Xk omitted)

lOMoARcPSD|23815106
• Simultaneous change in variables (e.g. Y increase ⇒ X increase and vice

versa)
– Simultaneity bias: Y affects X, X affects Y, so both effects are insep-

arably combined into the coefficient of X. This means the coefficient
estimate is biased.
• Sample selection (e.g. non-representative sampling on the dependent vari-

able)
Implications of endogeneity: OLS estimators become
• Biased
• Inconsistent
12.2.3 Functional form misspecification

Functional form misspecification: Model does not properly take into ac-
count relationship between dependent and independent variables. Occurs when
a key variable, which is a function of other variables, has been omitted.
E.g. non-linear relationship
Implications:
• se increases
• Estimates are biased/inconsistent
Example 1
True population model:
Y = β1 + β2 educ + β3 age + β4 age2 + u
Our model:
Y = β1 + β2 educ + β3 age + u
Implication: β1 , β2 , β3 all become biased
Example 2
True population model:
ln(income) = β1 + β2 educ + β3 age + β4 f emale + β5 f emale × age + u
Our model:
ln(income) = β1 + β2 educ + β3 age + β4 f emale + u
Implication: β1 , β2 , β3 , β4 all become biased

lOMoARcPSD|23815106
Plots for detecting functional misspecification

To examine the overall fit of a model, look at the scatter plot of:
• Actual values of Y against X vs the fitted line
• Residuals vs predicted values
To examine the functional form of a single regressor, look at the scatter plot
of:
• Residuals vs the regressor
Test for detecting functional misspecification

Regression Specification Error Test (RESET Test): a test for neglected
non-linear terms
• A special type of F-test
Limits of the RESET Test:
• n increases ⇒ estimator precision increases ⇒ change of rejecting H0

increases (in fact, H0 is rejected at any significance level as n → ∞)
• Cannot tell extent of model misspecification
• Doesn’t indicate how to correct misspecification
Consider the model
Y = β1 + β2 X2 + ... + βk Xk + u (R)
If this model is correct, then:
• Population assumptions are satisfied
• Including any particular choices of higher order terms or interac-

tion terms should not have any significant effect on the model

lOMoARcPSD|23815106
The RESET test adds polynomials of the OLS fitted values Ŷ , usually Yˆ2 and
Yˆ3 .
• Yˆ2 and Yˆ3 are 2 particular combinations of higher order terms and inter-
action terms of the regressors
• If the model is correct, adding Yˆ2 and Yˆ3 should not have any significant
effect
So, we get the model
Y = β1 + β2 X2 + ... + βk Xk + γ1 Yˆ2 + γ2 Yˆ3 + u (UR)
RESET Test:
1. Estimate restricted model (R) to get the predicted values Ŷ
2. Estimate unrestricted model (UR) with Yˆ2 and Yˆ3 as additional regressors
3. Perform F-test
• H0 : γ1 , γ2 = 0 and Ha : H0 f alse
• In this case, F-statistic is F (2, n − (k + 2)) distributed
• If we reject, (R) is likely to be misspecified
Example Comparing 2 Housing Price Models
We wish to compare the following 2 models:
price = β1 + β2 lotsize + β3 sqrf t + β4 bdrms + u (1)

ln(price) = β1 + β2 ln(lotsize) + β3 ln(sqrf t) + β4 bdrms + u (2)
Model 1:
ˆ 2 + γ2 price
• UR: price = β1 + β2 lotsize + β3 sqrf t + β4 bdrms + γ1 price ˆ 3 +u
• n = 88, q = 2, k = 4, F2,88−4−2
• F = 4.67
• Reject ⇒ likely misspecified
Model 1:
2
ˆ
• UR: ln(price) = β1 +β2 ln(lotsize)+β3 ln(sqrf t)+β4 bdrms+γ1 ln(price) +
3
ˆ
γ2 ln(price) +u
• n = 88, q = 2, k = 4, F2,88−4−2
• F = 2.56
• Do not reject ⇒ likely correctly specified
So, we prefer the log-log specification.

lOMoARcPSD|23815106
12.2.4 Heteroskedasticity
Heteroskedasticity: Errors for different observations have different variances.
Check scatter plots of:

• Y against X
• Residuals against X
Homoskedasticity
Heteroskedasticity
Implications of heteroskedasticity:
• Heteroskedasticity does NOT:
– Cause bias or inconsistency
– Affect estimated coefficients
– Affect goodness of fit (R2 , R¯2 )
• Heteroskedasticity does:
– Invalid formulas/estimators of variance ⇒ standard errors of estima-
tors are invalid ⇒ inferences invalid
Fix:
regress y x, robust

lOMoARcPSD|23815106
12.2.5 Autocorrelation
Implications:
• Estimators remain unbiased and consistent

• se calculations and inferences become much less accurate
2 common occurrences:
• Time series data

– Fix: HAC standard errors
• Clusters: errors correlated for observations in the same cluster, uncorre-
lated for observations in different clusters
– Cluster robust standard errors

1st in Course Ecmt1020 Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1st in Course Ecmt1020 Notes

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|23815106

July 10, 2020

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

Week 1: Summarising Uni-

1.1 Types of data

• Variable: Characteristic of interest

• Case: Unit for which we measure a variable

Data classification criteria:

• Variable number: univariate, bivariate, multivariate

• Variable type: quantitative, categorical

Different statistical techniques are used for each data type.

1.1.2 Types of variables

• Continuous: Can be arbitrarily precise

• Discrete: Cannot take arbitrarily precise values

• Not naturally numbers

NOTE: Numbers can be used to represent categorical data

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

4 CHAPTER 1. WEEK 1: SUMMARISING UNIVARIATE DATA

1.1.3 Types of cases

• Panel data: Combination of cross-section and time series; A characteristic

• Cross section: n data points

– Each case is labelled xi for i = 1, 2, 3, ..., n

• Time series: t data points

– Each case is labelled xt

• Panel data: n × t data points

– Data points labelled xnt

• Bivariate: effect of explanatory variable x on response variable y E.g.

• Multivariate: effect of explanatory variables x1 , x2 , ... on y E.g. x1 =

1.2 Summarising data

1. Assume dataset is a sample of the population

2. Calculate estimate of population parameter from sample

• Convention: Greek letters for population parameters, Latin letters

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

1.2. SUMMARISING DATA 5

3. Inference to say something about the estimated parameter (e.g. confi-

Data analysis summary:

1. Summary: Summary statistics and graphs to illumate essential features

3. Interpretation: What is the economic meaning?

1.2.2 Summary statistics

1. Central tendency E.g. Average, median

3. Asymmetry E.g. Skew

4. Kurtosis E.g. Outliers

Stata: mean [data], tabstat [data], or stat(mean)

Sample median Median: Middle observation

• Advantage: Not prone to outliers

• Disadvantage: Not sensitive to outliers

Stata: summarize [data], tabstat [data], or stat(median)

Sample mode Mode: Most frequently occuring observation

Sample midrange Midrange: Average of min and max observations

• Very sensitive to outliers; essentially useless

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

6 CHAPTER 1. WEEK 1: SUMMARISING UNIVARIATE DATA

• Degrees of freedom: number of values involved in calculations that have

• Total number of observations - number of independent constraints on ob-

• Disadvantage: nobody uses it

IQR = 3rdquartile − 1stquartile (1.6)

x normally distributed ⇒ Kurt(x) = 3

Downloaded by Farrell Wihandoyo (wihandoyofarrell@gmail.com)

1.2. SUMMARISING DATA 7

Box plot indicates:

• Quartiles (incl. median)

Note: Outliers are above U pperquartile + 1.5 ∗ IQR