You are on page 1of 58

QBUS2810

Statistical Modeling for Business

t
en
m
cu
Do
p
wa
ks
in
Th

Semester 1, 2017


NB: When a new response is outside the estimation sample in a regression, it is
independent of the sample error.

Simple Linear Regression



Chapter 1 Summary
• Causal inferences are most certain – if not completely definitive – in randomized
experiments, but observational data can also be reasonably marshaled as evidence
of causation.
• Good experimental practice seeks to avoid confounding experimentally manipulated
explanatory variables with other variables that can influence the response variable.
• In analyzing observational data, it is important to distinguish between a variable that
is a common prior cause of an explanatory variable and response variable and a
variable that intervenes causally between the two.
• It is overly restrictive to limit the notion of statistical causation to explanatory
variables that are manipulated experimentally, to explanatory variables that are

t
en
manipulated in principle, or to data that are collected over time.
• Randomization and good sampling design are desirable in social research, but they
m
are not prerequisites for drawing statistical inferences. Even when randomization or
cu
random sampling is employed, we typically want to generalize beyond the strict
bound of statistical inference.
Do


Chapter 2 Summary
p

• In very large samples, and when the explanatory variables are discrete, it is possible
wa

to estimate a regression by directly examining the conditional distribution of Y given


the Xs. When the explanatory variables are continuous, we can proceed similarly by
ks

dissecting the Xs into a large number of narrow bins.


o Sampling error (variance): sampling error is minimized by using a small
in

number of wide bins, each with many observations.


Th

o Bias: Bias is minimized by using a large number of narrow bins


• In smaller samples, local averages of Y can be calculated in a neighborhood or
window surrounding each x-value. There is a tradeoff in local averaging between the
bias and variance of estimates: Narrow windows reduce bias, but because they
include fewer observations increase variance.
• Lowess (locally weighted regression) produces smoother results than local
averaging, reducing boundary bias, and can discount outliers. The degree of
smoothness is controlled by the span of the lowess smoother: larger spans yield
smoother lowess regressions.


Statistical Relationship

Data analysists and statisticians in business are interested in measuring and predicting the
quantitative effect of events and interventions.

Definition

2
• Two random variables Y and X have a relationship when they are not independent,
that is:
𝑃 𝑌, 𝑋 ≠ 𝑃 𝑌 . 𝑃 𝑋

• I.e. two variables have a relationship if changes in one variable are associated with
changes in the other variable.


Using data to measure statistical relationships
Businesses, financial markets, market research firms etc. collect data to consider questions
such as:
• How can I protect my customers from credit card fraud?
• Effects of targeted adds on purchase rates
• Effect of a drug on patients

t
en
Observational v Experimental data
• Experimental data: the values of an explanatory variable of interest are assigned by
m
a random mechanism, independently of other factors (randomized experiments)
cu
• Observational data: the value of the explanatory variables and responses are
observed by the researcher without intervention.
Do

o Typically this is the only data available


o QBUS2810 will focus on this type of data
p

o “correlation does not imply causation”


wa

o In interpreting observational data certain variables may be omitted


(Confounded)
ks

§ What makes an omitted variable relevant?


1. Omitted variable must influence the response
in

2. Omitted variable must be related to the explanatory variable


Th

o Observational data allows for prediction (if the sample is representative of


the population)
o Cross-sectional data: collected at a single point in time.
o Longitudinal data: collected over time

From relationship to predictability
Two conditions are needed:
1. The two variables must co-vary, or, have a relationship, when one changes the other
must also change (in-sample, training sample)
2. We need to be able to predict outcomes with an accuracy on unseen data (out-of-
sample, test sample)
Accurate predictions can have high commercial value


From relationship to casual statements
Three conditions need to be met to justify a statement that a measured effect of X on Y is a
casual effect:
1. The two variables must co-vary

3
2. The independent variable X must precede the dependent variable Y (in time)
3. No other factor could possibly have caused the measured change in Y

Because the circumstances of observational studies are usually not controlled by the
researcher, they generally do not allow for isolating the effect of X on Y among possible
omitted variables.

Theoretical relationships
Theory or domain expertise suggests important relationships amongst variables:
• Asset return = α + β × market return
• Company return = 𝑓 earnings
• Sales = 𝑓 advertising
• Bounce rate = 𝑓 website layout, content
• Probability purchase =
𝑓 gender, age, income, targeted marketing, past purchases …

t
From theoretical to empirical relationships

en
Theory or expert knowledge:
• Often neglects functional forms.
• Rarely suggests quantitative magnitudes.
m
cu
• May require empirical testing and/or verification.
Do

These are required to make effective business decisions




p
wa



ks

Regression Analysis

in

A regression is a statistical model for the probability distribution of a response Y as a


Th

function of one or more predictors X1, X2, …,Xp:



𝑃 𝑌 𝑥J , … , 𝑥K = 𝑓 𝑥J , … , 𝑥L

A regression:
• Predicts the value of a response variable based on the value of at least one
independent variable.
• Measures the effects of the explanatory variable(s) on the response variable.
• Measures the statistical uncertainty in these estimates.


Common assumptions
1. The conditional distributions P(Y|x) are all normal distributions with the same
variance, and the conditional means of Y are all on a straight line.

4

• Skewness: if the conditional distribution of Y is skewed (as at x1), then the mean will
not be a good summary of its center.
• Multiple modes: if the conditional distribution of Y is multimodal (as at x2), then it is

t
en
intrinsically unreasonable to summarize its center by a single number.
• Heavy tails: if the conditional distribution of Y is non-normal – for example heavy
m
tailed – (as at x3), then the sample mean will not be an efficient estimate of the
cu
center of the Y-distribution, even if it is symmetrical.
• Unequal spread: if the conditional variance of Y changes with the values of the X’s
Do

(compare x4 and x5), then the efficiency of the usual least-squares estimates may be
compromised.
p


wa



ks


in


Th


Simple Linear Regression (SLR)
𝑌M , 𝑋M for 𝑖 = 1, … , 𝑛 be the sample of interest. The SLR model is:

𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M
response = intercept + slope x predictor + error

𝑋M is the independent variable, repressor, predictor, or explanatory variable
𝑌M is the dependent variable or response
𝛽S is the population intercept
𝛽J is the population slope
𝜀M is the regression error (omitted factors, measurement error)

5













Least Squares Estimation

t
Question: How can we estimate 𝛽S and 𝛽J from the data?

en

The sample average 𝑌 is the least squares estimator of 𝜇V , that is 𝑌 is the solution to:

m
cu
[

min 𝑌M − 𝑚 Z
Do

W
M\J
The OLS estimator of 𝛽S and 𝛽J is:
p

[
wa

𝛽S , 𝛽J = min 𝑌M − 𝑏S − 𝑏J 𝑋M Z
]^ ,]_
M\J
ks


Other algorithms
in

1. Least absolute deviations: (minimize the sum of the absolute values of the errors).
Th

min |𝑌M − 𝑏S − 𝑏J 𝑋M |
]^ ,]_
M\J



2. Local smoothing: use the average value of Y for the data “close” to it, for each X
value (this will not give a straight line).

6



3. Minimize the sum of the absolute errors to the pth power.
[ K

min |𝑌M − 𝑏S − 𝑏J 𝑋M |

t
]^ ,]_

en
M\J

Analytical solution m

cu
[
Do

min 𝑌M − 𝑏S − 𝑏J 𝑋M Z
]^ ,]_
M\J
We can obtain the exact solutions to this problem using calculus:
p
wa


𝛽S = 𝑌 − 𝛽J 𝑋
ks


[
M\J 𝑋M − 𝑋 𝑌M − 𝑌 𝑆cV
in

𝛽J = [ Z
= Z
M\J 𝑋M − 𝑋 𝑆c
Th


where:

(ce fc)(Ve fV)
𝑆cV = is the sample covariance between Y and X; and
[fJ

c fc h
𝑆cZ = e
is the sample variance of X
[fJ


Notation and terminology

𝑌i = 𝛽S + 𝛽J 𝑋M : Fitted value
𝑒M = 𝑌M − 𝛽S − 𝛽J 𝑋M = 𝑌M − 𝑌i : Residual


Measuring fit

7
Question: How well does the regression fit the data?

• The standard error of the regression (SER) measures the standard deviation of the
regression errors.
• The regression 𝑅Z measures the fraction of the variance in Y “explained” by the
variations in X and the linear model.
o 𝑅Z ranges from 0 to 1

Standard error of the regression
SER is the same unit as Y

[ [
1 1
𝑆𝐸𝑅 = 𝑒M − 𝑒 Z = 𝑒MZ
𝑛−2 𝑛−2
M\J M\J

t
NB: proving that 𝑒 = 0 follows from the equation for 𝛽S

en

Question: Why divide by n – 2 instead of n – 1?

m
cu
• The division of n – 2 is a degrees of freedom correction
Do

o This is just like the division of n – 1 in 𝑆VZ , but here we have estimated two
parameters
• The SER is unbiased only if we use n – 2 (proof later)
p
wa

• The difference is of course negligible when n is large



ks


Analysis of variance decomposition
in


Th

Consider the following rearrangement:


𝑌M − 𝑌 = 𝑌i − 𝑌 + (𝑌M − 𝑌i )
= 𝑌i − 𝑌 + 𝑒M
Thus:
[ [ [
Z
𝑌M − 𝑌 Z
= 𝑌i − 𝑌 + 𝑒MZ
M\J M\J M\J

TSS = RegSS + RSS
Z
TSS = 𝑌M − 𝑌 : total sum of squares
Z
RegSS = 𝑌i − 𝑌 : regression sum of squares
RSS = 𝑒MZ : residual sum of squares

Variance

The general equation of variance is:

Var 𝑥 = 𝐸 𝑥 Z − [𝐸 𝑥 ]Z

8

The variance of an OLS estimator is:

Z
Var 𝜃 = 𝐸 𝜃−𝐸 𝜃



Coefficient of determination

TSS − RSS RSS
𝑅Z = =1−
TSS TSS
or

Z
𝑌i − 𝑌
𝑅Z =
𝑌M − 𝑌 Z

t
en

• R2 measures the proportion of the variation in the response Y that is accounted for
m
by the estimated linear regression line (with X)
cu
o NB: it is not a measure of how good the model is, or similarly a measure of
how well the model predicts the response
Do


Relationship between R2 and SLR

p
wa

𝑆cV (𝑋M − 𝑋)(𝑌M − 𝑌)


𝑟= =
𝑆c 𝑆V 𝑋M − 𝑋 Z 𝑌M − 𝑌 Z
ks

R2 corresponds to the square of the sample correlation coefficient between the response
and the predictor.
in

• NB R2 is symmetrical and thus it does not matter which variable is the response or
Th

predictor.


Application: Capital Asset Pricing Model (CAPM)

• CAPM was proposed by William Sharpe in 1970
• It implies an SLR model

Main idea
There are two types of risk:
• Systematic (non-diversifiable)
• Idiosyncratic (asset specific)
The expected return of an asset should only depend on the sensitivity of the asset to the
market portfolio (which represents total diversification)

9
Notation

𝑅v : asset return at time t
𝑅w ,v : risk free rate of return at time t
𝑅x,v : market return at time t
𝑇v = 𝑅v − 𝑅w,v : asset return premium at time t
𝑋v = 𝑅x,v − 𝑅w,v : market return premium at time t


Regression
The population regression line:

𝐸 𝑅v = 𝑅w,v + 𝛽 𝐸 𝑅x,v − 𝑅w,v

The empirical SLR model:

en
𝑅z − 𝑅w,v = 𝛼 + 𝛽 𝑅x,v − 𝑅w,v + 𝜖v


m
NB 𝛼 = 0 according to the theory (but not necessarily empirically)
cu

Do


Financial Returns

p

Let 𝑃v and 𝑃vfJ be the price of an asset at times 𝑡 and 𝑡 − 1 respectively. The return of the
wa

asset in this period is:



ks

𝑃v − 𝑃vfJ 𝑃v
𝑅v = = − 1
in

𝑃vfJ 𝑃vfJ
Th


• NB multiply by 100 to obtain percentage returns.
• NB it is standard to base returns on closing prices, adjusted for dividend payments

Financial returns often rely on log returns instead, which are mathematically more
convenient:

𝑅v = log 𝑃v − log 𝑃vfJ
• Where log is the natural logarithm
• NB it is a property of natural logarithms that log 1 + 𝑟 ≈ 𝑟 for small values of r

Questions

The key prediction of the CAPM is that differences in expected returns should be related to
beta. The larger the beta, the higher the expected returns. This is not always the case in
reality, as the model may show different results. Questions to consider:

• Is the model a strong or weak fit to the data?

10
o NB looking at R2 is incorrect
• Is the market beta significantly different from 1?
• Is Jensen’s alpha significantly different from 0?

Statistical Inference for Simple Linear Regression




Algorithms and statistical inference

Efron and Hastie (2016) distinguished between two areas of statistical development:
• Algorithmic developments aimed at specific problem areas (such as prediction)

t
en
• Inferential arguments in support of these methods, and for uncertainty
quantification.
m
o NB algorithms are what statisticians do, while inference says why they do
cu
them and what is the associated uncertainty.

Do

Statistical inference for SLR



p
wa
ks
in
Th



Question: How accurate is the OLS algorithm for SLR?

• Under what conditions does OLS give appropriate estimates of 𝛽S and 𝛽J ?
• Is it unbiased, consistent and/or efficient?
• What is the sampling distribution of the estimator?

To answer these questions some assumptions have to be established

Review of basic concepts

Let 𝜃 be an estimator for a fixed but unknown population parameter 𝜃.

11
• Unbiasedness: 𝐸 𝜃 = 𝜃 where the expectation is over a random sample
K
• Consistency: 𝜃 𝜃. The estimator gets closer to the true parameter in the
probability when the sample size gets larger
o Informally, the estimator gets it right with an arbitrarily large sample size
• Efficiency: an efficient estimator has variance equal to or lower than the variance of
all other possible estimators.



SLR model assumptions

1. Linearity: if 𝑋 = 𝑥 then 𝑌 = 𝛽S + 𝛽J 𝑥 + 𝜀 for some population parameters
𝛽S and 𝛽J and a random error 𝜀.
2. Exogeneity: the conditional mean of 𝜀 given X is zero, that is 𝐸 𝜀 𝑋 = 0. Hence

t
𝐸 𝑌 𝑋 = 𝑥 = 𝛽S + 𝛽J 𝑥.

en
3. Constant error of variance: Var 𝜀 𝑋 = 𝑥 = 𝜎 Z .
4. Independence: all the error pairs 𝜀M and 𝜀€ (𝑖 ≠ 𝑗) are independent
m
5. The distribution of X is arbitrary (X can be even non-random)
cu

Do

NB these assumptions are all non-trivial and will often be violated (perhaps all at the same
time.

p


wa

Assumptions 1 and 2
Assumptions 1 and 2 mean that the expectation of Y given X is a straight line:
ks


in

𝐸 𝑌 𝑋 = 𝐸 𝛽S + 𝛽J 𝑋 + 𝜀 𝑋 = 𝛽S + 𝛽J 𝑋 + 𝐸 𝜀 𝑋 = 𝛽S + 𝛽J 𝑋

Th

NB they imply that all other factors (i) have an average effect of zero on Y given X (b) are
uncorrelated with X.

Assumption 2
• In experimental data 𝐸 𝜀 𝑋 = 0 holds by design
• In observational data 𝐸 𝜀 𝑋 = 0 will generally not hold

Assumption 3
• The assumption that Var 𝜀 𝑋 = 𝑥 = 𝜎 Z plays a central role in the derivation of the
sampling distribution of the OLS estimator.
• The constant error variance case is also known as homoscedasticity (especially in
econometrics).
o Heteroscedasticity refers to the case where the variance is non-constant
§ It is possible to correct the OLS standard errors for non-constant
variance (heterocedasticity robust standard errors).
§ We can still use OLS when the assumption is not satisfied. However,
more efficient (lower variance) estimators are available.

12
• Data transformation is often helpful for obtaining a specification that
(approximately) satisfies this assumption.





Assumption 4
This arises automatically if the data is collected by a simple random sampling. The
assumption implies that the errors are uncorrelated.

Cov 𝜀M , 𝜀€ 𝑋M , 𝑋€ = 0 𝑖 ≠ 𝑗

• We may encounter non-i.i.d. (independent and identically distributed) sampling
where there is non-random sampling
o Example: convenience sampling

t
en
• Time series and spatial data will visually violate the assumption that the
observations are independent.
m
cu

Do

Beyond the assumptions: practical requirements



p

• Even if these strong assumptions are satisfied, this does not mean that we’re good in
wa

practice.
• The OLS estimator can be very sensitive to the structure of the data and may be
ks

heavily influenced by one or a few influential observations.


• Problems also arise for inference when the distribution of error is very skewed
in

(which can occur for example in financial data)


Th

• In those cases applying large sample results is questionable.


o Our application will only be reliable in typical sample sizes if large outliers are
“unlikely”.

Outliers
• Outliers are data points that are very far from the rest of the data. It is important to
develop intuition and understand when they can cause problems.
• Data errors are often a source of outliers. Always check and clean your data first


Checking the assumptions

In order to check assumptions, use residual diagnostics.

Sampling properties of the OLS estimator

13
Establishing the sampling properties of 𝛽S and 𝛽J will allow us to:
• Verify unbiasedness and consistency
• Quantify the sampling uncertainty associated with 𝛽S and 𝛽J
• Use 𝛽S and 𝛽J to test hypotheses (such as 𝛽J = 0)
• Construct a confidence interval for 𝛽J and sometimes 𝛽S

NB assume that assumptions 1-5 hold throughout (except when noted)



Preliminaries

Recall the following probability rules:

𝐸 𝑎𝑋 + 𝑏𝑌 = 𝑎𝐸 𝑋 + 𝑏𝐸 𝑌

t
en
Var 𝑎𝑋 + 𝑏𝑌 = 𝑎Z Var 𝑋 + 𝑏 Z Var 𝑌 + 2𝑎𝑏Cov(𝑋, 𝑌)

m
NB these are very important and need to be memorized.
cu

Suppose for example that 𝜀M and 𝜀€ (𝑖 ≠ 𝑗) are random errors from the SLR model
Do


𝐸 𝑎𝜀M + 𝑏𝜀€ = 𝑎𝐸 𝜀M + 𝑏𝐸 𝜀€ = 0
p


wa

Var 𝑎𝜀M + 𝑏𝜀€ = 𝑎Z Var 𝜀M + 𝑏 Z Var 𝜀€ = 𝜎 Z 𝑎Z + 𝑏 Z



ks

Key technique: if we can write an estimator (say, the OLS estimator) as a linear function of a
constant plus a sum of independent error terms (constant plus noise representation), then
in

we can easily calculate the expected value and variance using the rules.
Th


Let X and Y be two arbitrary random variables. The law of iterated expectations states that:

𝐸 𝑌 =𝐸 𝐸 𝑌𝑋

For our mode, Assumption 2, and the law of iterated expectations imply for example:

𝐸 𝜀M = 𝐸 𝐸 𝜀M 𝑋M = 𝐸 0 = 0

𝐸 𝑋M 𝜀M = 𝐸(𝐸 𝑋M 𝜀M 𝑋M = 𝐸 𝑋M 𝐸 𝜀M 𝑋M = 0





Constant plus noise representation

We need to establish some notation and auxiliary results.

14

(𝑋M − 𝑋)
𝑐M =
𝑋M − 𝑋 Z

Then:
𝑋M − 𝑋 = 0 ⟹ 𝑐M = 0
Z
𝑋M − 𝑋 𝑋M = 𝑋M − 𝑋 ⟹ 𝑐M 𝑋M = 1

From the formula for 𝛽J ,
(𝑋M − 𝑋)(𝑌M − 𝑌)
𝛽J = = 𝑐M (𝑌M − 𝑌)
𝑋M − 𝑋 Z

Notice that
𝛽J = 𝑐M 𝑌M − 𝑌

t
en
= 𝑐M 𝑌M − 𝑐M 𝑌
= 𝑐M 𝑌M − 𝑌 𝑐M
= 𝑐M 𝑌M m
cu

Interpretation: 𝛽J is a linear combination of the observations Yi. That’s the “linear” in linear
Do

estimator.

p

𝛽J = 𝑐M 𝑌M
wa

= 𝑐M 𝛽S + 𝛽J 𝑋M + 𝜀M
ks

= 𝛽S 𝑐M + 𝛽J 𝑐M 𝑋M + 𝑐M 𝜀M
= 𝛽J + 𝑐M 𝜀M
in


Th

Therefore:
𝛽J = 𝛽J + 𝑐M 𝜀M

Interpretation: 𝛽J is the true parameter plus a linear combination of the errors (if the model
is correct). That’s the constant plus noise representation.


Expected value and unbiasedness

From
𝛽J = 𝛽J + 𝑐M 𝜀M
It follows that:
𝐸 𝛽J = 𝐸 𝛽J + 𝑐M 𝜀M
= 𝛽J + 𝐸 𝑐M 𝜀M
= 𝛽J + 𝐸 𝑐M 𝜀M
= 𝛽J + 𝐸 𝐸 𝑐M 𝜀M 𝑋J , … , 𝑋[

15
= 𝛽J + 𝐸 𝑐M 𝐸 𝜀M 𝑋M
= 𝛽J

Using the linearity of expectations, the law of iterated expectations and Assumption 2.


Variance
Var 𝛽J = Var 𝛽J + 𝑐M 𝜀M
= Var 𝑐M 𝜀M
= 𝑐MZ Var 𝜀M Using Assumption 4
= 𝑐MZ 𝜎 Z Using Assumption 3
= 𝜎Z 𝑐MZ

Z
𝑋M − 𝑋 Z
=𝜎
𝑋M − 𝑋 Z Z
𝜎Z

t
en
=
𝑋M − 𝑋 Z
𝜎Z
=
𝑛 − 1 𝑠cZ
m
cu
Therefore:
Do


𝜎Z
Var 𝛽J =
𝑛 − 1 𝑠cZ
p

Interpretation:
wa

• The larger the noise around the population regression line, the larger the variance of
ks

𝛽J .
• The variance decreases with the sample size n, and is inversely proportional to it.
in

• The larger the variation in the regressor, the lower the variance of 𝛽J . More variation
Th

in the regressor helps to estimate the model.













• Under the standard assumptions, OLS has the lowest variance among all unbiased
estimators that are linear in Y, a result known as the Gauss-Markov theorem
• That is, the OLS estimator is the best linear unbiased estimator (BLUE) if the
assumptions hold.
• The OLS estimator loses this property if the errors have non-constant variance.

16
Consistency

𝐸 𝛽J = 𝛽J

𝜎Z
Var 𝛽J =
𝑛 − 1 𝑠cZ

• The variance of the estimator goes to zero when 𝑛 → ∞.
• Together with unbiasedness, that means the OLS estimator will approach 𝛽J in
probability as n gets larger.
• This estimator is consistent.


Large sample distribution using the CLT

SLR Model

t
en

Starting from the constant plus noise representation:
m
cu
𝛽J = 𝛽J + 𝑐M 𝜀M
Do


• If follows that finite sample distribution of 𝛽J depends on the population distribution
p

of the errors 𝜀M , which we did not specify.


wa

o NB all we know is the mean and variance of the sampling distribution.


• However, we can use the Central Limit Theorem (CLT) to approximate the sampling
ks

distribution of 𝛽J for the case when n is “sufficiently large”



in

General Setting
Th


Let X1,…, Xn be a sequence of independent and identically distributed (i.i.d.) random
variables such that 𝐸 𝑋M = 𝜇 and Var 𝑋M = 𝜎 Z for every i. Furthermore, let 𝑋 be the
sample average estimator. The CLT states that when 𝑛 → ∞,

ˆ 𝜎Z 𝑋−𝜇 ˆ
𝑋 𝑁 𝜇, 𝑜𝑟 𝜎 𝑁 0,1
𝑛
𝑛

• Hence, for a sufficiently large n we can use this result as an approximation to say
cf‹
Œ ≈ 𝑁 0,1 .

• Limitation: n may have to be huge for this approximation to be reasonable when the
distribution of X is very skewed and/or subject to large outliers.



SLR Model

17

𝛽J = 𝛽J + 𝑐M 𝜀M

means that 𝛽J is essentially a weighted average of the errors (recall that the variance of
𝛽J is proportional to 1/n). Hence a CLT applies to it.

For n sufficiently large,

𝛽J ≈ 𝑁 𝛽J , Var 𝛽J

𝛽J − 𝛽J
≈ 𝑁 0,1
Var 𝛽J

Summary

t
en
If the SLR model assumptions hold, then:
• 𝛽J is unbiased.
• The variance of 𝛽J is proportional to 1/n.
m
cu
• 𝛽J is consistent.
Do

• The finite sample distribution of 𝛽J is complicated, unkown and dependson the


distribution of X and 𝜀.
• When n is sufficiently large, 𝛽J ≈ 𝑁 𝛽J , Var 𝛽J .
p
wa


ks

Statistical Inference for Simple Linear Regression


in
Th


Standard error

We showed that:

𝜎Z
Var 𝛽J =
𝑋M − 𝑋 Z

• In practice we do not know 𝜎 Z
o We need to estimate it.
• Recall that the standard error of the regression (SER) estimates he standard
deviation of the errors:

𝑒MZ
SER =
𝑛−2

18
The standard error of 𝛽J is therefore:

SER
SE 𝛽J =
𝑋M − 𝑋 Z


Robust standard error
• Assumption 3 of constant error variance was crucial for deriving Var 𝛽J , but is
often unrealistic in practice.
• Heteroscedasticity robust standard errors allow us to account for this possibility
without the added layer of complexity of modelling the conditional variance
Var 𝜀M 𝑋M = 𝑥M or considering another estimator for 𝛽J .

If we cannot assume constant variance, then:

Var 𝛽J = 𝑐MZ Var 𝜀M 𝑋M = 𝑐MZ 𝜎MZ

t
en

We can estimate 𝑐MZ 𝜎MZ as 𝑐MZ 𝑒MZ to obtain the heteroscedasticity robust standard error:
m
cu
𝑋M − 𝑋 Z 𝑒MZ
SE 𝛽J =
Do

𝑛 − 1 Z 𝑠c•

Robust vs non-robust standard errors
p
wa

• Robust standard errors are more conservative (i.e. larger) compared to the regular
standard errors.
ks

• The main advantage of assuming constant variance is that the resulting tests are
more efficient (higher power), if the assumption of constant variance is correct.
in

• However, if the errors have non-constant variance, computing SEs under Assumption
Th

3 can lead to anti-conservative and wrong results.



Sampling distribution of the standard estimator

For “sufficiently” large n we can adopt the CLT approximation:

𝛽J − 𝛽J
≈ 𝑁 0,1
Var 𝛽J

Since SE 𝛽J estimates Var 𝛽J consistently, we use the same approximation when
estimating the variance of the errors.
When using the standard error (standard or robust) , we adopt the large sample
approximation:

19
𝛽J − 𝛽J
≈ 𝑁 0,1
SE 𝛽J

We may also refer to the following approximation:

𝛽J − 𝛽J
≈ 𝑡[fZ
SE 𝛽J

• tn-2 denotes the t distribution with n -2 degrees of freedom.
• The t distribution is the exact distribution when the errors are a Gaussian with
constant variables (see module 4)
• The t distribution becomes identical to the normal distribution when the degrees of
freedom increase, such that the distinction is immaterial for n large enough for an
acceptable CLT approximation.

t
en
Hypothesis testing
m
Objective: test a hypothesis about 𝛽J using observational data and reach a tentative
cu
conclusion whether a linear relationship exists, or whether it is positive or negative.
Do


Two sided:
𝐻S : 𝛽J = 0
p

𝐻J : 𝛽J ≠ 0
wa

One sided:
𝐻S : 𝛽J = 0
ks

𝐻J : 𝛽J > 0
in


Th

Objective: test a hypothesis about the magnitude of 𝛽 M .



Two sided:
𝐻S : 𝛽J = 𝑏
𝐻J : 𝛽J ≠ 𝑏
One sided:
𝐻S : 𝛽J = 𝑏
𝐻J : 𝛽J > 𝑏

Systematic steps
1. Formulate a question in terms of null and alternative hypothesis.
2. Specify assumptions and significance level.
3. Choose appropriate estimator, test statistic, and sampling distribution.
4. Construct the test statistic and compute the p-value (or compare it to the critical
value)
5. State conclusions in statistical and practical terms

General t-testing approach

20

estimator − value under 𝐻S
𝑡= ≈ 𝑡“
standard error of estimator

For testing a sample mean:
𝑌 − 𝜇S
𝑡=
𝑆V / 𝑛

For testing a regression slope:
𝛽J − 𝑏
𝑡=
SE 𝛽J


Testing the magnitude of 𝜷𝟏

• Large sample test:

t
en
𝛽J − 𝛽J
≈ 𝑡[fZ
SE 𝛽J
a) Construct the test statistic:
m
cu
𝛽J − 𝑏
𝑡=
Do

SE 𝛽J

p

b) Reject at the 𝛼% significance level if 𝑡 > 𝑡—,[fZ .


wa

h
c) The p-value is (the probability in the tails of the distribution outside |t|):
2×𝑃 𝑡[fZ > 𝑡
ks


in

d) Reject at 𝛼% significance level if the p-value is lower than 𝛼.


e) This is the test that always appears on the Python output.
Th


Summary











Interpretation

21
• Correct interpretation of a p-value: it is the probability under the null hypothesis, of
getting a result at least as extreme as what we observed.
• A significant parameter means that we can reliably detect that it is not exactly zero.
• A non-significant parameter can mean that either:
a) The parameter is indeed exactly zero (which likely makes no difference to our
lives)
b) We cannot measure the effect sufficiently accurately (the standard error is
large) to reliably say that it is not exactly zero.
• NB if the parameter is not significant provide and discuss the confidence interval.


Confidence intervals

Confidence intervals always have the form:

t
en
point estimate ± critical value × standard error
using:
𝛽J − 𝛽J m
≈ 𝑡[fZ
cu
SE 𝛽J

Do

We have that an approximate 100×(1 − 𝛼)% confidence interval for 𝛽J in large samples is:

p

𝛽J ± 𝑡[fZ,™ × SE 𝛽J
wa

Z

Interpretation
ks

• A 100×(1 − 𝛼)% confidence interval is:


in

o The range of values that we cannot reject at the 𝛼 significance level.


Th

o An interval estimator that has probability 1 − 𝛼 of containing the true


parameter in repeated experiments of obtaining a random sample of size n
from the population and computing a confidence interval.
• For a given calculated CI, we can say that either of those must be true
i. The population parameter is in the confidence interval
ii. We got unlucky (an event with probability 𝛼 has happened)

Concluding remarks

Concise way to report regressions

𝛽S 𝛽J
Y= + × 𝑋, RZ = ⋯ , SER = ⋯
SE 𝛽S SE 𝛽J
Summary
• OLS is unbiased and consistent under the standard assumptions.
• We can derive formulas for the standard errors.
• These SE formulas are unbiased and consistent under the assumptions.

22
• Large sample Cis and test statistics


Predicting with a SLR



SLR model predictions

SLR Model: 𝑌M = 𝛽S + 𝛽J 𝑥M + 𝜀M

What is a prediction:
• Prediction means generalizing our estimated model to the population.
• We start with an observed sample 𝑦J , 𝑥J , … , 𝑦[ , 𝑥[ and compute the least
squares estimators 𝛽S and 𝛽J for that sample. That leads to an estimated regression

t
en
line 𝛽S + 𝛽J 𝑥.
• We then observe a new case from the population where 𝑋 = 𝑥S , without knowing
𝑌S . Our prediction for 𝑌S is 𝛽S + 𝛽J 𝑥S . m
cu


Do

Conditional expectation estimator


• The SLR prediction is an estimator of the conditional expectation:
p


wa

𝑚 𝑥 ≡ 𝐸 𝑌 𝑋 = 𝑥 = 𝛽S + 𝛽J 𝑥

ks

• Our estimator of the conditional expectation is:



in

𝑚 𝑥 = 𝛽S + 𝛽J 𝑥
Th


Prediction error
We define the prediction error as:

𝜂S = 𝑌S − 𝑚 𝑥
under Assumption 1,

𝜂S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S
Interpretation:
The prediction error has two components
1. The estimation error for the regression line.
2. The unavoidable noise 𝜀S .


Theory and practice
• We need to make several assumptions about the model, however, we insist that
these assumptions are often unrealistic.

23
• Regression models can be very useful approximations to the truth
• Predictive modelling has a clear objective, which is to generate the best possible
predictions. We therefore propose models, measure and compare their performance
for prediction, and try to improve.
• Assumptions allow us to say more about a problem. In general, it is important to
understand in which situations violations cause serious problems.

In the SLR model we can drop Assumptions 1 and 2 (which essentially mean that the model
is correct) to say the following:
• Result: when 𝑛 → ∞, the least squares method will lead to the optimal linear
prediction of Y given X = x (in terms of mean-square error)



Sampling properties of the conditional estimator

en
Expected value and unbiasedness
From here on we are back to Assumptions 1 – 5.

m
cu
Since 𝛽S and 𝛽J are unbiased, the estimator of the conditional expectation is unbiased:
Do


𝐸 𝑚 𝑥 = 𝐸 𝛽S + 𝐸 𝛽J 𝑥 = 𝛽S + 𝛽J 𝑥

p

Variance
wa

1 𝑥−𝑋 Z
Var 𝑚 𝑥 = 𝜎 Z +
ks

𝑛 𝑛 − 1 𝑠cZ
Interpretation:
in

• It grows with 𝜎 Z . The more noise there is in the data, the harder it is to estimate the
Th

regression line.
• It is proportional to 1/n: the larger the sample size, the more accurate the regression
line.
• It has two components.
Ÿh
1. corresponds to the sampling uncertainty in estimating the mean of Y.
[
2. The slope decreases with 𝑠cZ , since a larger 𝑠cZ makes it easier to estimate the slope.
• It increases with 𝑥 − 𝑋 Z . The further we are from the center of the data, the larger
the sampling uncertainty.

Large sample approximation for the sampling distribution

For n sufficiently large, we can apply the CLT to approximate the distribution of 𝑚 𝑥 as:

Z
1 𝑥−𝑋 Z
𝑚 𝑥 ≈ 𝑁 𝛽S + 𝛽J 𝑥, 𝜎 +
𝑛 𝑛 − 1 𝑠cZ

24

Confidence interval for the conditional expectation

For n sufficiently large,

1 𝑥−𝑋 Z
𝑚 𝑥 ≈ 𝑁 𝛽S + 𝛽J 𝑥, 𝜎 Z
+
𝑛 𝑛 − 1 𝑠cZ
The approximate 100 ×(1 − 𝛼)% confidence interval for the regression line is therefore:

𝑚 𝑥 ± 𝑧™ Var 𝑚 𝑥
Z

Sampling properties of the SLR prediction error

t
Prediction error:

en
𝜂S = 𝑌S − 𝑚 𝑥S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S
m
The expected value and variance of the prediction error are:
cu

Do

𝐸 𝜂S = 𝐸 𝑚 𝑥S − 𝑚 𝑥S + 𝐸 𝜀S = 0

Var 𝜂S = Var 𝑚 𝑥S + Var 𝜀S
p
wa

J ¡^ fc h
= 𝜎 Z 1 + + h
[ [fJ ¢£

ks

Variance
in

1 𝑥S − 𝑋 Z
Var 𝜂S = 𝜎 Z 1 + +
Th

𝑛 𝑛 − 1 𝑠cZ

Interpretation:
• The first component is 𝜎 Z , associated with 𝜀S . We call this the irreducible error since
it does not decrease with the size of the estimation sample.
• The second is the variance of 𝑚 𝑥S , which has been derived above.


Prediction interval
A 100 ×(1 − 𝛼)% prediction interval 𝑌S,¤ , 𝑌S,¥ has the property that:

𝑃 𝑌S,¤ < 𝑌S < 𝑌S,¥ 𝑋S = 𝑥S = 1 − 𝛼

• The probability is over the random experiment of drawing a random sample of size n
from the population (for given values of the regressor), estimating 𝛽S and 𝛽J ,
computing 𝑌S,¤ and 𝑌S,¥ based on 𝛽S , 𝛽J and 𝑥S , and drawing 𝑌S |𝑋S = 𝑥S .
o NB This is a hard concept to grasp as it is a “mix” of computing a confidence
interval for 𝑚 𝑥S and computing a probability over the distribution of 𝜀S .

25
𝜂S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S

• Since the prediction error depends on 𝜀S , we can only build a prediction interval if
we explicitly estimate the distribution of the errors
o One option is to use a computational method called bootstrapping (beyond
our scope)
o Another option is to assume that 𝜀~𝑁 0, 𝜎 Z , in which case we can conduct
exact inference.

Technical details




en
The Gaussian SLR Model m

cu

Do

The Gaussian SLR Model



p

• So far analysis has not made any distributional assumptions about the regression
wa

errors: except for conditions such as 𝐸 𝜀 𝑋 = 𝑥 = 0, we left the probability


distribution of 𝜀 unspecified.
ks

o This flexibility is one reason why OLS is a popular method


o There are however, limits to OLS:
in

§ We do not know what the exact sampling distribution of the OLS


Th

estimator is in finite samples.


§ Inability to build prediction intervals
• This module considers what we can learn by assuming:

𝜀~𝑁 0, 𝜎 Z


Bias-variance trade-off: there is always a trade-off in statistical modeling. The more we
assume, the more we can say about a problem. Stronger assumptions lead to higher
statistical power (lower probability of Type II errors). However, if these assumptions are
unrealistic, then analysis is biased.

The Gaussian SLR Model

𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M 𝜀M ~𝑁 0, 𝜎 Z

where we maintain Assumptions 1 – 5 from Module 2.

26
A key feature of the Gaussian SLR Model is that we know the full form of the conditional
distribution of Y:

𝑌M |𝑋M = 𝑥M ~ 𝑁 𝛽S + 𝛽J 𝑥M , 𝜎 Z


Maximum likelihood estimation

Maximum likelihood (ML) estimation is available when we are able and want to specify a
full probabilistic model for the population.
• This is a new concept which we now introduce for the specific case of the Gaussian
SLR model.
• Intuitively, ML estimation chooses the values of the parameters that maximize the
probability of the observed data under the model.

t
𝑌M |𝑋M = 𝑥M ~ 𝑁 𝛽S + 𝛽J 𝑥M 𝜎 Z

en

m
The probability density function (PDF) evaluated at an observed value 𝑦M is:

cu
1 ¬ f- f- ¡ h
Z f e ^ h_ e
𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 = exp
Do


2𝜋𝜎 Z

p

This is the formula for the normal PDF from basic statistics. The difference is that in basic
wa

statistics it was formulate in terms of population mean 𝜇, while the SLR model specifies a
particular mean which is 𝛽S + 𝛽J 𝑥M .
ks



in

Review
Th

Recall that if two events A and B are independent if and only if their joint probability is the
product of their individual probabilities:

𝑃 𝐴 and 𝐵 = 𝑃 𝐴 . 𝑃 𝐵

When we talk about two independent continuous random variables U and V, their joint PDF
evaluated at U = u and V = v is the product of their individual PDFs:

𝑝 𝑢, 𝑣 = 𝑝 𝑢 𝑝 𝑣


Likelihood
The likelihood function is the joint PDF of the response evaluated at the observed sample
values. In our Gaussian SLR model, Assumption 4 (independence) implies that we can
multiply the PDFs for each observation:

27
[ [
1 ¬e f-^ f-_ ¡e h
Z f
𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 = exp ZŸ h
M\J M\J
2𝜋𝜎 Z


Log-likelihood
In practice, it is more convenient to work with the natural logarithm of the likelihood, the
log-likelihood:

𝐿 𝛽S , 𝛽J , 𝜎 Z = log 𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 Z

= log 𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 Z
𝑛 𝑛 1
= − log 2𝜋 − log 𝜎 Z − Z 𝑦M − 𝛽S − 𝛽J 𝑥M Z
2 2 2𝜎

The likelihood and the log-likelihood have the same maximum points.

en

Maximum Likelihood estimation m
We then maximize the log-likelihood as a function of the parameters:
cu

Do

max h 𝐿 𝛽S , 𝛽J , 𝜎 Z
-^ ,-_ ,Ÿ

𝑛 𝑛 1
wa

𝐿 𝛽S , 𝛽J , 𝜎 Z = − log 2𝜋 − log 𝜎 Z − Z 𝑦M − 𝛽S − 𝛽J 𝑥M Z
2 2 2𝜎

ks

• The last term corresponds to the residual sum of squares (RSS) times a negative
multiplier.
in

o Therefore, it turns out that ML estimation of 𝛽S and 𝛽J is the same as OLS for
Th

this model.
• We maximize for 𝜎 Z separately by replacing the OLS residual sum of squares into the
formula (since we already got 𝛽S and 𝛽J without needing to worry about 𝜎 Z ). This
leads to:
1
𝜎Z = 𝑒MZ
𝑛

Properties
Beyond our specific example ML estimation has the following properties under general
conditions:
• Consistency
• Unbiasedness when 𝑛 → ∞ (asymptotic unbiasedness)
• Efficiency when 𝑛 → ∞. No other estimator that is unbiased when 𝑛 → ∞ has
smaller variance when 𝑛 → ∞.
• The sampling distribution of the ML estimator becomes Gaussian when 𝑛 → ∞.
• Accurate finite sample distribution approximations (if you trust the model, and with
enough computational power).

28
• Limitation: ML can give bad or pathological estimates in certain cases.

Discussion
We ended up back at OLS, what is the point?
• Added a new algorithm to the toolbox and started by understanding it in a sample
case
• Learned that if the errors are Gaussian or approximately Gaussian, we can be
confident that ML/OLS will work very well for SLR.
• ML is broadly applicable. It will be necessary for variations and generalizations of our
regression framework. OLS will no longer be an option or make sense in those cases.
• We need the concept of a log-likelihood for model selection
• ML and OLS give different estimates when the distribution of the errors is not
Gaussian.

t
Statistical inference

en

Fundamental property of Gaussian distributions m
cu
Let X and Y be two arbitrary random variables and a, b and c be non-random scalars. Now,
define the random variable:
Do

𝑊 = 𝑎𝑋 + 𝑏𝑌 + 𝑐
It follows that:
p

𝑊~𝑁 𝐸 𝑊 , Var 𝑊
wa


Where we previously reviewed the formulas for 𝐸(𝑊) and Var 𝑊
ks


Interpretation:
in

• A linear combination of Gaussian random variables is itself a Gaussian random


Th

variable.

In Module 2 it was established that:
𝑋M − 𝑋
𝛽J = 𝛽J + 𝑐M 𝜀M = 𝛽J + 𝜀
𝑋M − 𝑋 Z M

Since 𝜀M is Gaussian for all i, it follows that 𝛽J is a linear combination of Gaussian random
variables.

Therefore for the Gaussian model:

𝜎Z
𝛽J ~𝑁 𝛽J ,
𝑛 − 1 𝑠cZ

where the mean and variance follow from earlier results, which hold regardless of the
distribution.

29
t-statistic

When estimating 𝜎 Z unbiasedly, we have:

𝛽J − 𝛽J
~𝑡[fZ
𝑆𝐸 𝛽J

where tn-2 denotes the t distribution with n – 2 degrees of freedom. This is an exact
sampling distribution.

Sampling distribution of the error variance estimator
In Gaussian SLR, we can show that:
𝑛𝜎 Z Z
~𝜒[fZ
𝜎Z

Z
where 𝜒[fZ denotes the chi-square distribution with n – 2 degrees of freedom

t
en

Review
m
• Let 𝑍J , 𝑍Z , … , 𝑍[ be n independent standard normal random variables. Then [M\J 𝑍MZ
cu
follows the 𝜒[Z distribution.

Do


Discussion
p

• In the Gaussian SLR model, we have exact sampling distributions and test statistics
wa

for OLS estimators, including 𝜎 Z .


• Those results parallel the ones from basic statistics for the sample mean.
ks

• This pattern extends to multiple linear regression



in


Th

Prediction interval
In the last module we investigated a case when we predict a new response Y0 for a given
value of x0 based on regression line estimated on a sample that is independent from Y0.

We defined the prediction error as:
𝜂S = 𝑌S − 𝑚 𝑥S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S
We proved that:
𝐸 𝑒S = 0
and
1 𝑥S − 𝑋 Z
Var 𝑒S = 𝜎 Z 1 + +
𝑛 𝑛 − 1 𝑠cZ

In the Gaussian SLR model, both 𝑚 𝑥S and 𝜀S are Gaussian, so that if follows from the
same arguments as before that:
𝜂S ~𝑁 0, Var 𝜂S

A 100 ×(1 − 𝛼)% prediction interval for Y0 is:

30
𝑚 𝑥S ± 𝑧™ Var 𝑒S
Z

where 𝑧— is the approximate critical value. In practice we replace Var 𝑒S with an
h
estimate based on plugging SER into the formula.

What does a linear regression estimate?



Interpreting a linear regression model

SLR model:
𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M

t
Regression analysis has multiple objectives:

en
1. Understanding the relationship between X and Y.
2. Prediction: estimating a regression function 𝐸(𝑌|𝑋 = 𝑥).
m
a. “If we observe X = x, then we would expect 𝐸(𝑌|𝑋 = 𝑥)”
cu
3. Causal analysis: estimating 𝐸 𝑌 do 𝑋 = 𝑥).
a. This is an explicit intervention: “if we do X = x, then we would expect
Do

𝐸(𝑌|𝑋 = 𝑥)”

p

Example
wa

1. Prediction: “students who participate in the Maths in Business program score 15


marks higher on QBUS2810 𝛽J = 15 on average.”
ks

2. Causal: “if we persuade students to participate in the Maths in Business program,


in

then they will score 15 marks higher on QBUS2810 𝛽J = 15 on average.”


Th


Interpretation
It is a mistake to make the second type of statement, when general regression analysis only
supports the first.


Interpretation of the slope coefficient

𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M

• Correct: if we select two cases from the population where X differs by 1, then we
would expect Y to differ by ß1.
• Acceptable: a unit change in X is associated with a ß1 change in Y on average in the
population.
• Incorrect (unless causal inference is supported): if we change X by 1, then Y will
change by ß1 on average.

31
Causal Analysis

• Run a randomized experiment. This is known as A/B testing in business.
Unfortunately, this is often not possible.
o Or find a natural experiment.
• Common sense principle: your study design must reflect what you want to measure.
If you want to estimate 𝐸 𝑌 do 𝑋 = 𝑥), then your data should have cases of “do X =
something” (with attention to the requirements for causal inference).

Examples
• Randomized experiment: randomly divide a cohort into two groups. Make half take
Maths in Business, the other half does not take it. Compare the results on
QBUS2810.
o You can see why randomized experiments are often not feasible.
• Natural experiment: suppose that the Business School makes Maths in Business
mandatory. We then compare the cohorts just before and after the external change,

t
en
perhaps “controlling” for the observed characteristics of the two cohorts (such as
their performance on first year units).
m
cu


Do

Omitted variable bias


p


wa

• It is essential to understand where the OLS method leads to biased estimates.


o While this section focuses on SLR (for simplicity) the same arguments apply
ks

to MLR.

in

𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M
Th


• The error, 𝜀, represents factors that influence the response Y but are not included in
the model. Hence there are always potential omitted variables.
• Omitted variable bias (OVB) in the OLS estimator occurs when the error term and X
are correlated.
𝐸(𝜀|𝑋) ≠ 0
NB this violates Assumption 2

• Therefore two conditions must hold for a particular omitted factor, Z, to include bias.
1) It needs to be related to Y (part of 𝜀)
2) It needs to be correlated with existing regressors

Theoretical Analysis

What is the size and direction of the bias?

𝑋M − 𝑋 𝜀M
𝛽J − 𝛽J =
𝑋M − 𝑋 Z

32
What if Cov 𝑋, 𝜀 = 𝜎c· ≠ 0?

In general:
𝑋M − 𝑋 𝜀M K Cov 𝑋, 𝜀 𝜌c 𝜎¹
𝛽J − 𝛽J = = ·
𝑋M − 𝑋 Z Var 𝑋 𝜎c

Omitted variable bias

K 𝜎¹
𝛽J 𝛽J + 𝜌c·
𝜎c
Interpretation:
• The OLS is positively biased (it overestimates ß1) when the regressor is positively
correlated with the error.
• The OLS is negatively biased (it underestimates ß1) when the regressor is negatively
correlated with the error.
• The bias does not get smaller with sample size. ß1 is inconsistent when 𝜌c· ≠ 0.

t
en
• The bias is larger when the correlation between the error and the regressor is larger
in absolute value.
m
• The bias is larger when the variance of the error is larger relative to the variance of
cu
X.

Do

Example

p


wa



ks


in


Th






• Marital status, age, home ownership, and purchase history stand out as being
positively related to both amount spent and salary.
• Hence, these regressor lead to omitted variables bias in the SLR
• Due to the positive correlation, 𝛽J is upward biased in the SLR model.


Omitted variable bias and causality
The presence of omitted variables rules out causal interpretation of estimated linear
regression coefficients.

33
Recall, the three conditions needed to establish causality from X to Y:
1) X and Y must co-vary or have a relationship. When one changes the other must also
change.
2) The independent variable X must precede the dependent variable Y in time.
3) No other factor could have possibly have caused the measured change in Y.

NB experimental studies eliminate OVB by assessing X independently of 𝜀.


Omitted variable bias and prediction
• We are not necessarily interested in estimating the population ß1.
• In many applications, we simply want to predict Y given X. Omitted variable bias is a
problem in this case only to the extent it limits the accuracy of the predictive model.

t
Measurement error in the regressor

en

• m
Measurement error in regressor is an issue for several areas of application of
cu
regression analysis, such as finance and economics.
• This is a special case of OVB, where the omitted variable is the actual regressor,
Do

o Example: CAPM. The CAPM says that the cross section of expected returns
should be a linear function of the asset betas.
o But we need to estimate beta first. This leads to a measurement error
p
wa

problem when we investigate the relationship between betas and average


returns.
ks


CAPM example
in

Data is generated according to:


Th

𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M

Unfortunately, instead of xi, we observe:
𝑋i = 𝑥M + 𝑢M

where ui is an i.i.d. measurement error which is independent of 𝜀M and Xi and satisfies
𝐸 𝑢M = 0.

Hence the model we are trying to estimate is:

𝑌M = 𝛽S + 𝛽J 𝑋i + 𝜀M − 𝛽J 𝑢M
or
𝑌M = 𝛽S + 𝛽J 𝑋i + 𝛿M

where 𝛿M = 𝜀M − 𝛽J 𝑢M

Because 𝑋i depends on 𝑢M , which is part of 𝛿M ,

34
𝐸 𝑋i 𝛿M ≠ 0 → 𝐸 𝛿M |𝑋i ≠ 0, thus violating the SLR model assumption.


General case
Using similar arguments for the general OVB case,

K Cov 𝑋, 𝑌 𝜎cZ
𝛽J = 𝛽J Z
Var 𝑋 𝜎c + 𝜎»Z
Interpretation:
• When there is measurement error in the regressor, the slope estimator is pulled
h
Ÿ£
towards zero (the regression line is biased towards being flatter) since h < 1.
Ÿ£ ¼Ÿ½h
We call this an attenuation bias.
• The bias is more severe when the variance of the measurement error is larger
relative to the variance of the regressor.

t
en


m
cu
Multiple Linear Regression
Do


wa

Multiple Linear Regression


ks


𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
in


Th

A MLR describes the relationship between a numerical, continuous response variable Y and
multiple predictors X1, … , Xp.

Statistical Objective: To estimate the conditional expectation 𝐸 𝑌 𝑋J = 𝑥J , … , 𝑋K = 𝑥K for
predictive purposes.
• In some cases we may be interested in a causal prediction. However, this requires a
study design specific for this purpose.


Terminology and notation

𝑌M = 𝛽S + 𝛽J 𝑋JM + 𝛽Z 𝑋ZM + ⋯ + 𝛽K 𝑋KM + 𝜀M

• 𝑌 is the response or dependent variable.
• 𝑋J , … , 𝑋K are the predictors, independent variables, regressor, or features.
• 𝛽S is the intercept, a fixed and unknown population parameter.
• 𝛽J , … , 𝛽K are the regression coefficients, fixed and unknown population parameters.

35
• 𝜀 are random errors.
• The subscript 𝑖 indexes the observations, 𝑖 = 1, … , 𝑛.
• 𝑌M , 𝑋JM , 𝑋ZM , … , 𝑋KM denote random variables.
• 𝑦M , 𝑥JM , 𝑥ZM , … 𝑥KM denote observed values for the 𝑖 v¥ unit.


Parameter interpretation
Consider a case with two regressor:
𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + 𝜀

• 𝛽J is the expected difference in Y when we select two cases from the population
where X1 differs by one unit, and X2 is the same.
• 𝛽Z is the expected difference in Y when we select two cases from the population
where X2 differs by one unit, and X1 is the same.

X1 differs by ∆𝑥J , X2 is the same:

t
en

𝐸 𝑌 𝑋J = 𝑥J + ∆𝑥J , 𝑋Z = 𝑥Z − 𝐸 𝑌 𝑋J = 𝑥J , 𝑋Z = 𝑥Z
m
= 𝛽S + 𝛽J 𝑥J + ∆𝑥J + 𝛽Z 𝑥Z − 𝛽S + 𝛽J 𝑥J + 𝛽Z 𝑥Z = 𝛽J ∆𝑥J
cu

General rule:
Do

𝑌M = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀

p

• 𝛽J is the expected difference in Y when we select two cases from the population
wa

where X1 differs by one unit, and X2,…,Xp are the same.


• 𝛽€ is the expected difference in Y when we select two cases from the population
ks

where Xj differs by one unit, and Xk is the same for all k ≠ j.



in


Th

Least squares estimation



Define the residual sum of squares (RSS) for arbitrary coefficient values 𝑏J , 𝑏Z , … , 𝑏K as:

[
Z
RSS 𝑏 = 𝑌M − 𝑏S − 𝑏J 𝑋JM − ⋯ − 𝑏K 𝑋KM
M\J

We define the ordinary least squares (OLS) estimator 𝛽S , 𝛽J , … , 𝛽K as the estimator that
minimizes the RSS.

[
Z
𝛽S , 𝛽J , … , 𝛽K = min 𝑌M − 𝑏S − 𝑏J 𝑋JM − ⋯ − 𝑏K 𝑋KM
]^ ,]_ ,…,]¿
M\J

Intuition

36
If X1 and X2 are correlated, then X1 and X2 predict each other. Hence, the predictive
information from X1 and X2 combined for Y does not amount to adding up the information
from X1 and X2 separately, as this would entail some “double counting”.

Notation and terminology

Fitted values:
𝑌i = 𝛽S + 𝛽J 𝑋JM + 𝛽Z 𝑋ZM + ⋯ + 𝛽K 𝑋KM

Residuals:
𝑒M = 𝑌M − 𝑌i


Measuring fit

t
en
Standard error of the regression
The SER estimates the standard deviation of the model errors 𝜀.
m
cu
[
1
SER = 𝑒MZ
Do

𝑛−𝑝−1
M\J

NB dividing by 𝑛 − 𝑝 − 1 leads to an unbiased estimator of the error variance (because we


wa

estimate 𝑝 + 1 parameters).

ks

Analysis of variance decomposition


in


Th

Z Z
𝑌M − 𝑌i = 𝑌i − 𝑌i + 𝑒MZ

TSS = ResSS + RSS

• TSS: total sum of squares
• RegSS: regression sum of squares
• RSS: residual sum of squares



R2

Z
RegSS RSS 𝑒MZ
𝑅 = =1− =1− Z

TSS TSS 𝑌M − 𝑌i
Interpretation:
• R2 measures the proportion of the variance in the response data that is accounted
for by the estimated linear regression model.

37
• R2 can only increase when you add another variable to the model.
• R2 is an useful part of the regression toolbox, but it does not measure the predictive
accuracy of the estimated regression, or more generally how good the model is.

Adjusted R2
The adjusted R2 penalizes the R2 for the number of regressors.

𝑛−1 RSS
𝑅Z = 1 −
𝑛 − 𝑝 − 1 TSS

• The 𝑅Z has the same interpretation as R2
• Unlike R2, 𝑅Z may increase or decrease upon the addition of another regressor.
• 𝑅Z < 𝑅Z . If n is large then 𝑅Z ≈ 𝑅Z .
• The adjusted R2 can be negative.

t
en
Statistical inferences for MLR I
m
cu
The MLR model
Do


Assumptions

p

1. Linearity: if 𝑋J = 𝑥J , … 𝑋K = 𝑥K then 𝑌 = 𝛽S + 𝛽J 𝑥J + ⋯ + 𝛽K 𝑥K + 𝜀 for some


wa

population parameters 𝛽S , 𝛽J , … , 𝛽K and a random error 𝜀.


2. Exogeneity: the conditional mean of 𝜀 given 𝑋J , … , 𝑋K is zero, that is
ks

𝐸 𝜀 𝑋J , … , 𝑋K = 0.
in

3. Constant error of variance: Var 𝜀 𝑋J , … , 𝑋K = 𝜎 Z .


Th

4. Independence: all the error pairs 𝜀M and 𝜀€ (𝑖 ≠ 𝑗) are independent


5. The distribution of 𝑿𝟏 , … , 𝑿𝒑 is arbitrary (X can be even non-random)
6. There is no perfect multicollinearity.

Perfect multicollinearity
• Perfect multicollinearity occurs when one of the regressor is an exact linear
combination of the other regressor.
o I.e. there are one or more predictors that do not add any information to the
regression.

Example of multicollinearity:
• Suppose there are two variables, male and female. These variables take a value of
one if the customer is a male and zero if the customer is female. These are called
dummy variables.
o Once we have one of these variables the other does not add any information.
§ If male is one then female must be zero.
o The ß0 is the expected amount when all the regressor are zero. But male and
female variables cannot both be zero.

38
• Another example leading to multicollinearity is when there are more predictors than
observations: 𝑝 > 𝑛.


Terminology

• Perfect multicollinearity: one or more predictors are an exact linear combination of
the other predictors.
• Multicollinearity: one or more predictors are highly correlated with a linear
combination of other predictors.
• Collinearity: two predictors are highly correlated with each other.


Beyond the assumptions

t
The MLR model assumptions re not sufficient to guarantee reliable estimation and inference

en
in practice. Analysis should always consider:
• Outliers (unusual observations)
• Very skewed errors
m
cu
• High correlation amongst predictors
Do

Assumption checking
wa


Residual diagnostics
ks

Need to observe multiple plots here.


in


Fitted values against residuals
Th

39
Predictors against residuals

t
Fitted values against squared or absolute residuals

en



m
cu


Do


wa



ks



in


Th





Predictors against squared or absolute residuals

40
Residual distribution

t
en
m
cu
Do


wa



ks

If observations are ordered: residuals against coordinates (time and/or space)



in


Th

Sampling properties of the OLS estimator



Under Assumptions 1 – 6 we can show that the least squares estimator 𝛽S , 𝛽J , … , 𝛽K is:

• A linear function of Y1, … ,Yn. Therefore, it is a relatively simple estimator that is a
type of weighted average of the data.
• Unbiased: 𝐸 𝛽Â = 𝛽€ for all 𝑗.
• Consistent
• Normally distributed when 𝑛 → ∞ (CLT)
• The most efficient linear unbiased estimator of the population parameters
when 𝑛 → ∞.
• If the errors follow a normal distribution, the OLS estimator is the maximum
likelihood estimator.



41
Variance

𝜎Z 𝜎Z
Var 𝛽Â = Z =
1 − 𝑅€Z ( [
M\J 𝑥€M − 𝑥Â ) 1 − 𝑅€Z 𝑛 − 1 𝑠¡ZÃ

𝑅€Z : the r-squared of a regression of predictor j on all other predictors

𝑥M€ : observed value of predictor j for observation i.

𝑥Â : sample average predictor j.

𝑠¡ZÃ : sample variance of predictor j

Interpretation:
• The formula is similar to the one for SLR but with the addition of 1/(1 − 𝑅€Z ).

t
en
• The higher the correlation of predictor j with other predictors, the higher the
variance of 𝛽Â .
m
• The variance of 𝛽Â is proportional to the variance of the errors 𝜎 Z .
cu
• The variance decreases with the sample variance of predictor j.

Do

Why does a higher correlation with other predictors increase the variance of 𝛽Â , all else
equal?
p

• 𝛽€ ∆𝑥€ is the expected difference in Y when we select two cases from the population
wa

where Xj differs by ∆𝑥€ , and Xk is the same for all k ≠ j.


• intuitively: if Xj is highly correlated with other predictors, then there is not a lot of
ks

sample variation for Xj in isolation. Hence, there is limited information to estimate 𝛽€


in

as defined.
Th



Sampling distribution: Gaussian errors
In addition to Assumptions 1 – 6, suppose we assume that the errors are normally
distributed,
𝑌 = 𝛽S + 𝛽J 𝑥J + 𝛽Z 𝑥Z + ⋯ + 𝛽K 𝑥K + 𝜀 𝑁 0, 𝜎 Z
Then:
𝛽Â ~𝑁 𝛽€ , Var 𝛽Â


Sampling distribution: general case
Assumptions 1 – 6 only gives moment conditions for errors:
𝐸 𝜀 𝑋J , … , 𝑋K = 0.
Var 𝜀 𝑋J , … , 𝑋K = 𝜎 Z

Therefore we do not know the distribution of 𝛽Â in the general case, apart from its mean
and variance.

42
For n sufficiently large, we may choose to rely on the CLT approximation:
𝛽Â ≈ 𝑁 𝛽€ , Var 𝛽Â

Appendix

Conditional inferences
• The model treats the predictors 𝑋J , … , 𝑋K as random. This assumption reflects the
nature of observational data.
• However, when discussing statistical inference we treat the predictors as given.
o For example, we provide the variance of the least squares estimator as a
function of the observed predictor values.
o This is because we naturally work with observed predictors
• The sample uncertainty of interest to us is therefore over errors 𝜀J , 𝜀Z , … , 𝜀[ .

t
en
Statistical inferences for MLR II

m
cu
Standardised coefficients
Do


Standard errors

p

The standard error for 𝛽Â is:


wa


SER
ks

SE 𝛽Â =
Z
in

1− 𝑅€Z [ 𝑥€M − 𝑥Â ]
Th


Where the standard error of the regression estimates 𝜎:

1
SER = 𝑒MZ
𝑛−𝑝−1

Robust standard errors
• Use robust standard errors when the assumption of constant variance is not satisfied
for the data.
o NB the technical details for robust standard errors are not discussed for MLR


Sampling distribution in Gaussian MLR model
If the errors are Gaussian, we can show that:

𝛽Â − 𝛽€
~𝑡[fKfJ
SE 𝛽Â

43
Inference for one coefficient

Sampling distribution: general case

If the distribution of errors is unspecified but the sample size is sufficiently large, we use the
CLT approximation.
𝛽Â − 𝛽€
≈ 𝑁 0,1
SE 𝛽Â
or
𝛽Â − 𝛽€
≈ 𝑡[fKfJ
SE 𝛽Â

Hypothesis testing

Relationship Test:

t
en
Two sided:
𝐻S : 𝛽€ = 0
m
𝐻J : 𝛽€ ≠ 0
cu
Interpretation:
- If the null hypothesis is correct, there is no relationship between predictor j and the
Do

response conditional on the other predictors.


- Alternatively, we can say that variable j does not predict the response, after taking the
p

other predictors into account.


wa


One sided:
ks

𝐻S : 𝛽€ = 0
𝐻J : 𝛽€ > 0
in


Th

General Test:
Two sided:
𝐻S : 𝛽€ = 𝑏
𝐻J : 𝛽€ ≠ 𝑏
One sided:
𝐻S : 𝛽€ = 𝑏
𝐻J : 𝛽€ > 𝑏

Test Statistic
𝛽Â − 𝑏€
𝑡-Ä = ~𝑡[fKfJ
SE 𝛽Â

• We can carry out hypothesis testing on 𝛽€ using the standard t-statistic with degrees
of freedom n – p – 1
• The test is a large sample approximation when the errors are not Gaussian.

44
Summary of Hypothesis testing

t
en
m
cu

Do


Example of Hypothesis Testing:
p
wa
ks
in
Th


Test: 𝐻S : 𝛽J = 0 vs 𝐻J : 𝛽J ≠ 0

Significance level: 𝛼 = 0.05

45
Assumptions: MLR model assumptions with non-constant error variance, n = 1000 is
sufficiently large for a CLT approximation (no outliers)

Estimator: 𝛽Â (OLS). Robust standard error.

-Ä f]Ã
Test statistic: 𝑡-Ä = ~𝑡[fKfJ
ÅÆ -Ä

S.SJÇÇ
Calculated statistic: 𝑡-Ä = = 23.5 (from the output)
S.SSJ

Decision: the p-value of < 0.0001 (from the output) is lower than 𝛼 = 0.05. Alternatively,
23.5 > 𝑡ÇÇÉ,S.SZÊ = 1.96.

Conclusion: We reject the null hypothesis

t
en


Why the p-value does not measure coefficient importance m
cu

𝛽Â
Do

𝑡-Ä =
SER
1 − 𝑅€Z 𝑛 − 1 𝑠¡ZÃ
p
wa

Interpretation (all else equal):


• A higher coefficient magnitude leads to a larger test statistic.
ks

• A larger sample size increases the test statistic.


• A lower error variance increases the test statistic.
in

• A lower correlation with the other predictors increase the test statistic.
Th

• A larger sample variance of the predictor increases the test statistic.



NB: The statistical significance of a coefficient is influenced by several factors
𝜎 Z , 𝑛, 𝑠¡Ã and 𝑅€Z that having nothing to do with real world importance of 𝛽€ .


Confidence interval

point estimate ± critical value × standard error
Using
𝛽Â − 𝑏€
≈ 𝑡[fKfJ
SE 𝛽Â

The approximate 100 × (1 − 𝛼)% confidence interval for 𝛽€ in large samples is:

𝛽Â ± 𝑡[fKfJ,™ × SE 𝛽Â
Z

46

Inference for multiple coefficients

Consider the linear model:
𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
Test:
𝐻S : 𝛽J = 𝛽Z = ⋯ = 𝛽K = 0
𝐻J : at least one 𝛽€ ≠ 0

F Statistic

RegSS/𝑝
𝐹¢vÎv =
RSS/(𝑛 − 𝑝 − 1)
Interpretation:
• TSS = RegSS + RSS decomposes the variation in the data between the variation

t
accounted for by the estimated regression (RegSS) and the unaccounted variation

en
(RSS).
m
• If the null hypothesis is correct, we would expect a relatively low RegSS and a
relatively high RSS, leading to a small F statistic.
cu

Do


Understanding the F test: Chi-squared distribution

p

𝑍~𝑁 0,1 ⟹ 𝑉 = 𝑍 Z ~𝜒 Z
wa



ks

𝑍M ~𝑁 0,1 i. i. d. ⟹ 𝑉 = 𝑍MZ ~𝜒“Z


in

M\J

Th

Understanding the F test: F distribution



F distribution is defined as the distribution of the ratio of two independent chi-squared
random variables scaled by their degrees of freedom, i.e.

𝑈/𝑢
𝑈~𝜒»Z , 𝑉~𝜒“Z ⟹ ~𝐹
𝑉/𝑣 »,“

We call this an F distribution with
numerator degrees of freedom u
and denominator degrees of
freedom v.





47
Understanding the F test: general case
ÑÒÓÅÅ ÑÅÅ
Under the null hypothesis, for a Gaussian MLR model h and h follow 𝜒 Z distributions
Ô Ô
with p and (n – p – 1) degrees of freedom respectively.

From the definition of the F distribution:
RegSS/𝑝 χZK /𝑝
𝐹¢vÎv = ~ Z ≡ 𝐹K,[fKfJ
RSS/(𝑛 − 𝑝 − 1) 𝜒[fKfJ /(𝑛 − 𝑝 − 1)

In the general case, we use the F distribution as a large sample approximation.

Key concepts of F test
• F test is a one-sided test. We reject the null hypothesis for high values of the F
statistic.
• We denote the critical value as 𝐹K,[fKfJ,™ and define it as the value such that
𝑃 𝐹K,[fKfJ > 𝐹K,[fKfJ,™ = 𝛼

t
• The p-value is 𝑃 𝐹K,[fKfJ > 𝑓Öz×z , where 𝑓Öz×z is the test statistic calculated for the

en
sample.
m
• A heteroscedasticity robust version of the F test is available but the formulation
cu
departs substantially from the one that we have for the constant variance case.

Do




p
wa



ks



in


Th







ANOVA table
Sometimes the F test is reported as an ANOVA table.

48
F statistic: R2 formulation
Recall that
RegSS RSS
𝑅Z = =1−
TSS TSS
We can then rewrite the F stat as:
RegSS/𝑝
𝐹Öz×z =
RSS/(𝑛 − 𝑝 − 1)
(ÑÒÓÅÅ/ØÅÅ) ([fKfJ)
= ×
(ÑÅÅ/ØÅÅ) K
Ùh ([fKfJ)
= ×
JfÙ h K

Example

×ÚÛ
Test: 𝐻S : 𝛽J = 𝛽Z = 0 vs 𝐻J : 𝛽J ≠ 0 β ≠ 0
ÜÝ Z

Significance level: 𝛼 = 0.05

t
en

Assumptions: MLR model assumptions with non-constant error variance, n = 1000 is
sufficiently large for a CLT approximation (no outliers) m
cu

Estimator: OLS
Do


ÑÒÓÅÅ/K
Test statistic: 𝐹¢vÎv = ~𝐹K,[fKfJ
p

ÑÅÅ/([fKfJ)

wa

Calculated statistic: 𝑓¢vÎv = 786.5 (from the output)



ks

Decision: the p-value of < 0.0001 (from the output) is lower than 𝛼 = 0.05. Alternatively,
in

786.5 > 𝐹Z,ÇÇÉ,S.SSÊ = 3.00.


Th


Conclusion: We reject the null hypothesis


F test for q linear restrictions

Appendix



49

Data transformation



Introduction and Example

Data transformation consists of applying a deterministic mathematical function to each
observation of a response or predictor.

Data transformation is typically used for the following purposes:
• Modelling nonlinearity
• Meeting the assumption of constant error variance

t
• Reducing skewness

en

Example: Direct Marketing m
In MLR modules we introduced a predictive model for the amount spent by customers:
cu

Do

AS = βS + 𝛽J ×Salary + 𝛽à ×Catalogs + ε
p
wa
ks
in
Th



• The relationship between amount spent and salary, holding catalogs constant was
represented in a scatter plot with a regression line.
• The model was then estimated:

50
53.68 0.0199 51.695
AS = − + ×Salary + ×Catalogs RZ = 0.612 RZ = 0.611
(659.15) (0.001) (2.912)

• The next step was to check whether the model fits the data
• The assumptions are laid out
• Multiple graphs are plotted:
o Fitted values against residuals
o Fitted values against squared residuals
o Residual distribution
• The diagnostics (see images in module 7) reveal the following limitations:
o The residuals follow a nonlinear pattern. Hence, Assumptions 1 and 2 cannot
be correct.
o The residuals have non-constant variance. Assumption 3 is not correct.
o The residuals are positively skewed. Skewed errors do not violate any
assumption, but are less ideal.
• The regression can be improved by implementing log transformations.

t
en

Log transformations
m
cu


Do


wa


ks


• The interpretation of the slope coefficient is different in each case
in

o Test by plugging in X values and comparing Y.


Th

• Using Taylor series:


∆𝑥
log 𝑥 + ∆𝑥 − log 𝑥 ≈ for small ∆𝑥
𝑥






Linear-log regression function
𝐸 𝑌 𝑋 = 𝑥 = 𝛽S + 𝛽J log 𝑥
Consider a change in x:
𝐸 𝑌 𝑋 = 𝑥 + ∆𝑥 = 𝛽S + 𝛽J log(𝑥 + ∆𝑥)
Subtracting the two:
∆𝑥
𝐸 𝑌 𝑋 = 𝑥 + ∆𝑥 − 𝐸 𝑌 𝑋 = 𝑥 = 𝛽J log 𝑥 + ∆𝑥 − log 𝑥 ≈ 𝛽J
𝑥

51
A 1% difference in X (a 0.01 difference in log(X)) is associated with a 0.01ß1 expected
difference in Y.










t
en

m
Log-linear regression function
cu
log(𝑦) = 𝛽S + 𝛽J 𝑥
Do

Consider a change in y:
log(𝑦 + ∆𝑦) = 𝛽S + 𝛽J 𝑥 + ∆𝑥
Subtracting the two:
p

∆𝑦
wa

log(𝑦 + ∆𝑦) − log 𝑦 = 𝛽J ∆𝑥 → ≈ 𝛽J ∆𝑥


𝑦
so,
ks

∆𝑦/𝑦
𝛽J ≈ for small ∆𝑥
in

∆𝑥

Th

A unit difference in X is therefore associated with a 100 X ß1% expected difference in Y.




Log-log regression function
log(𝑦) = 𝛽S + 𝛽J log 𝑥
Consider a change in x:
log(𝑦 + ∆𝑦) = 𝛽S + 𝛽J log(𝑥 + ∆𝑥)
Subtracting the two:
∆𝑦 ∆𝑥
log(𝑦 + ∆𝑦) − log 𝑦 = 𝛽J log 𝑥 + ∆𝑥 − log 𝑥 → ≈ 𝛽J
𝑦 𝑥
therefore,
∆𝑦/𝑦
𝛽J ≈ for small ∆𝑥/𝑥 and small ∆𝑦/𝑦
∆𝑥/𝑥

A 1% increase in X is associated with a ß1% change in Y (ß1 is an eleasticity)

52











en
m

cu

Do


Log transformation Summary
p
wa
ks
in
Th

53
Estimating the conditional expectation
Consider the model:
log 𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀

where X1, … , Xp are arbitrary predictors that may involve transformations.
• This specification means that:
𝐸 log 𝑌 𝑋J = 𝑥J , … , 𝑋K = 𝑥K = 𝛽S + 𝛽J 𝑥J + ⋯ + 𝛽K 𝑥K
• However, we are more often interested in knowing:
𝐸 𝑌 𝑋J = 𝑥J , … , 𝑋K = 𝑥K

Considering the initial model and solving for y:

𝑌 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
𝑌 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K exp ( 𝜀)

t
en
The conditional expectation is:
𝐸 𝑌 𝑋 = 𝑥 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K 𝐸(exp ( 𝜀) 𝑋 = 𝑥

m
cu
There are two conditions here:
Do

1. 𝐸 exp 𝜀 > exp 𝐸 𝜀 = 1 Jensen’s inequality


2. The back-transformation exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K is therefore
downward biased as an estimator of 𝐸 𝑌 𝑋 = 𝑥 .
p


wa

Gaussian MLR
If we assume that
ks

𝜀~𝑁 0, 𝜎 Z
in

Then we can show that:


Th

𝜎Z
𝐸 exp 𝜀 = exp
2
Therefore in this case:
𝜎Z
𝐸 𝑌 𝑋 = 𝑥 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K +
2

NB: if the errors are not Gaussian, we can use this result as an approximation.

When we estimate the parameters by OLS, we replace the unknown error variance by its
usual estimator:
𝑆𝐸𝑅Z
𝑚 𝑥 = exp 𝛽S + 𝛽J 𝑥J + ⋯ + 𝛽K 𝑥K +
2

General case
Duan (1983) proposed the following estimator that we can use when the errors are non-
Gaussian:
1
𝑚 𝑥 = exp 𝛽S + 𝛽J 𝑥J + ⋯ + 𝛽K 𝑥K exp 𝑒M
𝑛

54
Which estimator to use?
• The bias corrected estimators do not necessarily improve accuracy, as they have a
cost in terms of variance.
• The naïve back-transformation may perform better in some cases.
• Compute fitted values or predictions to investigate which approach performs better.

Example: Direct Marketing
Comparison of the regression models in terms of in-sample fit in the original scale.





t
en

m

cu

Do



p
wa

• The log-log specification is the best (in-sample) fit for this data. The bias corrected
estimate gives a slight improvement.
ks

• RSME is the room mean-square error (the lower the better)


in

1
RSME = 𝑦M − 𝑦i Z
Th

55
Power transformations
In the above example using the log function made the response negatively skewed.

t
en
• Power transformations are a more general type of transformation that allows you to
obtain approximately symmetrical, or even normal distributions.
• Some common power transformations are: m
cu
o Square root: 𝑌
o Inverse: 1/𝑌
Do

o Square: 𝑌 Z
o Box-Cox:
p


wa

Box-Cox transformation
The Box-Cox transformation is a one parameter family of transformations defined as:
ks


𝑦ã − 1
in

ã
𝑦 = if λ ≠ 0
𝜆
Th

log 𝑦 if λ = 0


ã
𝑦ã − 1
𝑦 =
𝜆

• Lower values of 𝜆 reduce positive skewness.
• Higher values of 𝜆 reduce negative skewness.

Box-Cox regression
𝑌 ã = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀

• The disadvantage of the Box-Cox regression is that the coefficients do not have
convenient interpretations as in the log regression.
• This highlights another possible trade-off in statistical modelling: accuracy v
interoperability

56








t
en


m
cu

Do


wa


ks


in


Th

57






Th
in
ks
wa
p
Do
cu
m
en
t

58

You might also like