Professional Documents
Culture Documents
t
en
m
cu
Do
p
wa
ks
in
Th
Semester 1, 2017
NB: When a new response is outside the estimation sample in a regression, it is
independent of the sample error.
t
en
manipulated in principle, or to data that are collected over time.
• Randomization and good sampling design are desirable in social research, but they
m
are not prerequisites for drawing statistical inferences. Even when randomization or
cu
random sampling is employed, we typically want to generalize beyond the strict
bound of statistical inference.
Do
Chapter 2 Summary
p
• In very large samples, and when the explanatory variables are discrete, it is possible
wa
2
• Two random variables Y and X have a relationship when they are not independent,
that is:
𝑃 𝑌, 𝑋 ≠ 𝑃 𝑌 . 𝑃 𝑋
• I.e. two variables have a relationship if changes in one variable are associated with
changes in the other variable.
Using data to measure statistical relationships
Businesses, financial markets, market research firms etc. collect data to consider questions
such as:
• How can I protect my customers from credit card fraud?
• Effects of targeted adds on purchase rates
• Effect of a drug on patients
t
en
Observational v Experimental data
• Experimental data: the values of an explanatory variable of interest are assigned by
m
a random mechanism, independently of other factors (randomized experiments)
cu
• Observational data: the value of the explanatory variables and responses are
observed by the researcher without intervention.
Do
3
2. The independent variable X must precede the dependent variable Y (in time)
3. No other factor could possibly have caused the measured change in Y
Because the circumstances of observational studies are usually not controlled by the
researcher, they generally do not allow for isolating the effect of X on Y among possible
omitted variables.
Theoretical relationships
Theory or domain expertise suggests important relationships amongst variables:
• Asset return = α + β × market return
• Company return = 𝑓 earnings
• Sales = 𝑓 advertising
• Bounce rate = 𝑓 website layout, content
• Probability purchase =
𝑓 gender, age, income, targeted marketing, past purchases …
t
From theoretical to empirical relationships
en
Theory or expert knowledge:
• Often neglects functional forms.
• Rarely suggests quantitative magnitudes.
m
cu
• May require empirical testing and/or verification.
Do
ks
Regression Analysis
in
4
• Skewness: if the conditional distribution of Y is skewed (as at x1), then the mean will
not be a good summary of its center.
• Multiple modes: if the conditional distribution of Y is multimodal (as at x2), then it is
t
en
intrinsically unreasonable to summarize its center by a single number.
• Heavy tails: if the conditional distribution of Y is non-normal – for example heavy
m
tailed – (as at x3), then the sample mean will not be an efficient estimate of the
cu
center of the Y-distribution, even if it is symmetrical.
• Unequal spread: if the conditional variance of Y changes with the values of the X’s
Do
(compare x4 and x5), then the efficiency of the usual least-squares estimates may be
compromised.
p
wa
ks
in
Th
Simple Linear Regression (SLR)
𝑌M , 𝑋M for 𝑖 = 1, … , 𝑛 be the sample of interest. The SLR model is:
𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M
response = intercept + slope x predictor + error
𝑋M is the independent variable, repressor, predictor, or explanatory variable
𝑌M is the dependent variable or response
𝛽S is the population intercept
𝛽J is the population slope
𝜀M is the regression error (omitted factors, measurement error)
5
Least Squares Estimation
t
Question: How can we estimate 𝛽S and 𝛽J from the data?
en
The sample average 𝑌 is the least squares estimator of 𝜇V , that is 𝑌 is the solution to:
m
cu
[
min 𝑌M − 𝑚 Z
Do
W
M\J
The OLS estimator of 𝛽S and 𝛽J is:
p
[
wa
𝛽S , 𝛽J = min 𝑌M − 𝑏S − 𝑏J 𝑋M Z
]^ ,]_
M\J
ks
Other algorithms
in
1. Least absolute deviations: (minimize the sum of the absolute values of the errors).
Th
min |𝑌M − 𝑏S − 𝑏J 𝑋M |
]^ ,]_
M\J
2. Local smoothing: use the average value of Y for the data “close” to it, for each X
value (this will not give a straight line).
6
3. Minimize the sum of the absolute errors to the pth power.
[ K
min |𝑌M − 𝑏S − 𝑏J 𝑋M |
t
]^ ,]_
en
M\J
Analytical solution m
cu
[
Do
min 𝑌M − 𝑏S − 𝑏J 𝑋M Z
]^ ,]_
M\J
We can obtain the exact solutions to this problem using calculus:
p
wa
𝛽S = 𝑌 − 𝛽J 𝑋
ks
[
M\J 𝑋M − 𝑋 𝑌M − 𝑌 𝑆cV
in
𝛽J = [ Z
= Z
M\J 𝑋M − 𝑋 𝑆c
Th
where:
(ce fc)(Ve fV)
𝑆cV = is the sample covariance between Y and X; and
[fJ
c fc h
𝑆cZ = e
is the sample variance of X
[fJ
Notation and terminology
𝑌i = 𝛽S + 𝛽J 𝑋M : Fitted value
𝑒M = 𝑌M − 𝛽S − 𝛽J 𝑋M = 𝑌M − 𝑌i : Residual
Measuring fit
7
Question: How well does the regression fit the data?
• The standard error of the regression (SER) measures the standard deviation of the
regression errors.
• The regression 𝑅Z measures the fraction of the variance in Y “explained” by the
variations in X and the linear model.
o 𝑅Z ranges from 0 to 1
Standard error of the regression
SER is the same unit as Y
[ [
1 1
𝑆𝐸𝑅 = 𝑒M − 𝑒 Z = 𝑒MZ
𝑛−2 𝑛−2
M\J M\J
t
NB: proving that 𝑒 = 0 follows from the equation for 𝛽S
en
Question: Why divide by n – 2 instead of n – 1?
m
cu
• The division of n – 2 is a degrees of freedom correction
Do
o This is just like the division of n – 1 in 𝑆VZ , but here we have estimated two
parameters
• The SER is unbiased only if we use n – 2 (proof later)
p
wa
Analysis of variance decomposition
in
Th
8
The variance of an OLS estimator is:
Z
Var 𝜃 = 𝐸 𝜃−𝐸 𝜃
Coefficient of determination
TSS − RSS RSS
𝑅Z = =1−
TSS TSS
or
Z
𝑌i − 𝑌
𝑅Z =
𝑌M − 𝑌 Z
t
en
• R2 measures the proportion of the variation in the response Y that is accounted for
m
by the estimated linear regression line (with X)
cu
o NB: it is not a measure of how good the model is, or similarly a measure of
how well the model predicts the response
Do
Relationship between R2 and SLR
p
wa
R2 corresponds to the square of the sample correlation coefficient between the response
and the predictor.
in
• NB R2 is symmetrical and thus it does not matter which variable is the response or
Th
predictor.
Application: Capital Asset Pricing Model (CAPM)
• CAPM was proposed by William Sharpe in 1970
• It implies an SLR model
Main idea
There are two types of risk:
• Systematic (non-diversifiable)
• Idiosyncratic (asset specific)
The expected return of an asset should only depend on the sensitivity of the asset to the
market portfolio (which represents total diversification)
9
Notation
𝑅v : asset return at time t
𝑅w ,v : risk free rate of return at time t
𝑅x,v : market return at time t
𝑇v = 𝑅v − 𝑅w,v : asset return premium at time t
𝑋v = 𝑅x,v − 𝑅w,v : market return premium at time t
Regression
The population regression line:
𝐸 𝑅v = 𝑅w,v + 𝛽 𝐸 𝑅x,v − 𝑅w,v
The empirical SLR model:
en
𝑅z − 𝑅w,v = 𝛼 + 𝛽 𝑅x,v − 𝑅w,v + 𝜖v
•
m
NB 𝛼 = 0 according to the theory (but not necessarily empirically)
cu
Do
Financial Returns
p
Let 𝑃v and 𝑃vfJ be the price of an asset at times 𝑡 and 𝑡 − 1 respectively. The return of the
wa
𝑃v − 𝑃vfJ 𝑃v
𝑅v = = − 1
in
𝑃vfJ 𝑃vfJ
Th
• NB multiply by 100 to obtain percentage returns.
• NB it is standard to base returns on closing prices, adjusted for dividend payments
Financial returns often rely on log returns instead, which are mathematically more
convenient:
𝑅v = log 𝑃v − log 𝑃vfJ
• Where log is the natural logarithm
• NB it is a property of natural logarithms that log 1 + 𝑟 ≈ 𝑟 for small values of r
Questions
The key prediction of the CAPM is that differences in expected returns should be related to
beta. The larger the beta, the higher the expected returns. This is not always the case in
reality, as the model may show different results. Questions to consider:
• Is the model a strong or weak fit to the data?
10
o NB looking at R2 is incorrect
• Is the market beta significantly different from 1?
• Is Jensen’s alpha significantly different from 0?
t
en
• Inferential arguments in support of these methods, and for uncertainty
quantification.
m
o NB algorithms are what statisticians do, while inference says why they do
cu
them and what is the associated uncertainty.
Do
Question: How accurate is the OLS algorithm for SLR?
• Under what conditions does OLS give appropriate estimates of 𝛽S and 𝛽J ?
• Is it unbiased, consistent and/or efficient?
• What is the sampling distribution of the estimator?
To answer these questions some assumptions have to be established
Review of basic concepts
Let 𝜃 be an estimator for a fixed but unknown population parameter 𝜃.
11
• Unbiasedness: 𝐸 𝜃 = 𝜃 where the expectation is over a random sample
K
• Consistency: 𝜃 𝜃. The estimator gets closer to the true parameter in the
probability when the sample size gets larger
o Informally, the estimator gets it right with an arbitrarily large sample size
• Efficiency: an efficient estimator has variance equal to or lower than the variance of
all other possible estimators.
SLR model assumptions
1. Linearity: if 𝑋 = 𝑥 then 𝑌 = 𝛽S + 𝛽J 𝑥 + 𝜀 for some population parameters
𝛽S and 𝛽J and a random error 𝜀.
2. Exogeneity: the conditional mean of 𝜀 given X is zero, that is 𝐸 𝜀 𝑋 = 0. Hence
t
𝐸 𝑌 𝑋 = 𝑥 = 𝛽S + 𝛽J 𝑥.
en
3. Constant error of variance: Var 𝜀 𝑋 = 𝑥 = 𝜎 Z .
4. Independence: all the error pairs 𝜀M and 𝜀€ (𝑖 ≠ 𝑗) are independent
m
5. The distribution of X is arbitrary (X can be even non-random)
cu
Do
NB these assumptions are all non-trivial and will often be violated (perhaps all at the same
time.
p
wa
Assumptions 1 and 2
Assumptions 1 and 2 mean that the expectation of Y given X is a straight line:
ks
in
𝐸 𝑌 𝑋 = 𝐸 𝛽S + 𝛽J 𝑋 + 𝜀 𝑋 = 𝛽S + 𝛽J 𝑋 + 𝐸 𝜀 𝑋 = 𝛽S + 𝛽J 𝑋
Th
NB they imply that all other factors (i) have an average effect of zero on Y given X (b) are
uncorrelated with X.
Assumption 2
• In experimental data 𝐸 𝜀 𝑋 = 0 holds by design
• In observational data 𝐸 𝜀 𝑋 = 0 will generally not hold
Assumption 3
• The assumption that Var 𝜀 𝑋 = 𝑥 = 𝜎 Z plays a central role in the derivation of the
sampling distribution of the OLS estimator.
• The constant error variance case is also known as homoscedasticity (especially in
econometrics).
o Heteroscedasticity refers to the case where the variance is non-constant
§ It is possible to correct the OLS standard errors for non-constant
variance (heterocedasticity robust standard errors).
§ We can still use OLS when the assumption is not satisfied. However,
more efficient (lower variance) estimators are available.
12
• Data transformation is often helpful for obtaining a specification that
(approximately) satisfies this assumption.
Assumption 4
This arises automatically if the data is collected by a simple random sampling. The
assumption implies that the errors are uncorrelated.
Cov 𝜀M , 𝜀€ 𝑋M , 𝑋€ = 0 𝑖 ≠ 𝑗
• We may encounter non-i.i.d. (independent and identically distributed) sampling
where there is non-random sampling
o Example: convenience sampling
t
en
• Time series and spatial data will visually violate the assumption that the
observations are independent.
m
cu
Do
• Even if these strong assumptions are satisfied, this does not mean that we’re good in
wa
practice.
• The OLS estimator can be very sensitive to the structure of the data and may be
ks
13
Establishing the sampling properties of 𝛽S and 𝛽J will allow us to:
• Verify unbiasedness and consistency
• Quantify the sampling uncertainty associated with 𝛽S and 𝛽J
• Use 𝛽S and 𝛽J to test hypotheses (such as 𝛽J = 0)
• Construct a confidence interval for 𝛽J and sometimes 𝛽S
NB assume that assumptions 1-5 hold throughout (except when noted)
Preliminaries
Recall the following probability rules:
𝐸 𝑎𝑋 + 𝑏𝑌 = 𝑎𝐸 𝑋 + 𝑏𝐸 𝑌
t
en
Var 𝑎𝑋 + 𝑏𝑌 = 𝑎Z Var 𝑋 + 𝑏 Z Var 𝑌 + 2𝑎𝑏Cov(𝑋, 𝑌)
m
NB these are very important and need to be memorized.
cu
Suppose for example that 𝜀M and 𝜀€ (𝑖 ≠ 𝑗) are random errors from the SLR model
Do
𝐸 𝑎𝜀M + 𝑏𝜀€ = 𝑎𝐸 𝜀M + 𝑏𝐸 𝜀€ = 0
p
wa
Key technique: if we can write an estimator (say, the OLS estimator) as a linear function of a
constant plus a sum of independent error terms (constant plus noise representation), then
in
we can easily calculate the expected value and variance using the rules.
Th
Let X and Y be two arbitrary random variables. The law of iterated expectations states that:
𝐸 𝑌 =𝐸 𝐸 𝑌𝑋
For our mode, Assumption 2, and the law of iterated expectations imply for example:
𝐸 𝜀M = 𝐸 𝐸 𝜀M 𝑋M = 𝐸 0 = 0
𝐸 𝑋M 𝜀M = 𝐸(𝐸 𝑋M 𝜀M 𝑋M = 𝐸 𝑋M 𝐸 𝜀M 𝑋M = 0
Constant plus noise representation
We need to establish some notation and auxiliary results.
14
(𝑋M − 𝑋)
𝑐M =
𝑋M − 𝑋 Z
Then:
𝑋M − 𝑋 = 0 ⟹ 𝑐M = 0
Z
𝑋M − 𝑋 𝑋M = 𝑋M − 𝑋 ⟹ 𝑐M 𝑋M = 1
From the formula for 𝛽J ,
(𝑋M − 𝑋)(𝑌M − 𝑌)
𝛽J = = 𝑐M (𝑌M − 𝑌)
𝑋M − 𝑋 Z
Notice that
𝛽J = 𝑐M 𝑌M − 𝑌
t
en
= 𝑐M 𝑌M − 𝑐M 𝑌
= 𝑐M 𝑌M − 𝑌 𝑐M
= 𝑐M 𝑌M m
cu
Interpretation: 𝛽J is a linear combination of the observations Yi. That’s the “linear” in linear
Do
estimator.
p
𝛽J = 𝑐M 𝑌M
wa
= 𝑐M 𝛽S + 𝛽J 𝑋M + 𝜀M
ks
= 𝛽S 𝑐M + 𝛽J 𝑐M 𝑋M + 𝑐M 𝜀M
= 𝛽J + 𝑐M 𝜀M
in
Th
Therefore:
𝛽J = 𝛽J + 𝑐M 𝜀M
Interpretation: 𝛽J is the true parameter plus a linear combination of the errors (if the model
is correct). That’s the constant plus noise representation.
Expected value and unbiasedness
From
𝛽J = 𝛽J + 𝑐M 𝜀M
It follows that:
𝐸 𝛽J = 𝐸 𝛽J + 𝑐M 𝜀M
= 𝛽J + 𝐸 𝑐M 𝜀M
= 𝛽J + 𝐸 𝑐M 𝜀M
= 𝛽J + 𝐸 𝐸 𝑐M 𝜀M 𝑋J , … , 𝑋[
15
= 𝛽J + 𝐸 𝑐M 𝐸 𝜀M 𝑋M
= 𝛽J
Using the linearity of expectations, the law of iterated expectations and Assumption 2.
Variance
Var 𝛽J = Var 𝛽J + 𝑐M 𝜀M
= Var 𝑐M 𝜀M
= 𝑐MZ Var 𝜀M Using Assumption 4
= 𝑐MZ 𝜎 Z Using Assumption 3
= 𝜎Z 𝑐MZ
Z
𝑋M − 𝑋 Z
=𝜎
𝑋M − 𝑋 Z Z
𝜎Z
t
en
=
𝑋M − 𝑋 Z
𝜎Z
=
𝑛 − 1 𝑠cZ
m
cu
Therefore:
Do
𝜎Z
Var 𝛽J =
𝑛 − 1 𝑠cZ
p
Interpretation:
wa
• The larger the noise around the population regression line, the larger the variance of
ks
𝛽J .
• The variance decreases with the sample size n, and is inversely proportional to it.
in
• The larger the variation in the regressor, the lower the variance of 𝛽J . More variation
Th
16
Consistency
𝐸 𝛽J = 𝛽J
𝜎Z
Var 𝛽J =
𝑛 − 1 𝑠cZ
• The variance of the estimator goes to zero when 𝑛 → ∞.
• Together with unbiasedness, that means the OLS estimator will approach 𝛽J in
probability as n gets larger.
• This estimator is consistent.
Large sample distribution using the CLT
SLR Model
t
en
Starting from the constant plus noise representation:
m
cu
𝛽J = 𝛽J + 𝑐M 𝜀M
Do
• If follows that finite sample distribution of 𝛽J depends on the population distribution
p
General Setting
Th
Let X1,…, Xn be a sequence of independent and identically distributed (i.i.d.) random
variables such that 𝐸 𝑋M = 𝜇 and Var 𝑋M = 𝜎 Z for every i. Furthermore, let 𝑋 be the
sample average estimator. The CLT states that when 𝑛 → ∞,
ˆ 𝜎Z 𝑋−𝜇 ˆ
𝑋 𝑁 𝜇, 𝑜𝑟 𝜎 𝑁 0,1
𝑛
𝑛
• Hence, for a sufficiently large n we can use this result as an approximation to say
cf‹
Œ ≈ 𝑁 0,1 .
•
• Limitation: n may have to be huge for this approximation to be reasonable when the
distribution of X is very skewed and/or subject to large outliers.
SLR Model
17
𝛽J = 𝛽J + 𝑐M 𝜀M
means that 𝛽J is essentially a weighted average of the errors (recall that the variance of
𝛽J is proportional to 1/n). Hence a CLT applies to it.
For n sufficiently large,
𝛽J ≈ 𝑁 𝛽J , Var 𝛽J
𝛽J − 𝛽J
≈ 𝑁 0,1
Var 𝛽J
Summary
t
en
If the SLR model assumptions hold, then:
• 𝛽J is unbiased.
• The variance of 𝛽J is proportional to 1/n.
m
cu
• 𝛽J is consistent.
Do
ks
Standard error
We showed that:
𝜎Z
Var 𝛽J =
𝑋M − 𝑋 Z
• In practice we do not know 𝜎 Z
o We need to estimate it.
• Recall that the standard error of the regression (SER) estimates he standard
deviation of the errors:
𝑒MZ
SER =
𝑛−2
18
The standard error of 𝛽J is therefore:
SER
SE 𝛽J =
𝑋M − 𝑋 Z
Robust standard error
• Assumption 3 of constant error variance was crucial for deriving Var 𝛽J , but is
often unrealistic in practice.
• Heteroscedasticity robust standard errors allow us to account for this possibility
without the added layer of complexity of modelling the conditional variance
Var 𝜀M 𝑋M = 𝑥M or considering another estimator for 𝛽J .
If we cannot assume constant variance, then:
Var 𝛽J = 𝑐MZ Var 𝜀M 𝑋M = 𝑐MZ 𝜎MZ
t
en
We can estimate 𝑐MZ 𝜎MZ as 𝑐MZ 𝑒MZ to obtain the heteroscedasticity robust standard error:
m
cu
𝑋M − 𝑋 Z 𝑒MZ
SE 𝛽J =
Do
𝑛 − 1 Z 𝑠c•
Robust vs non-robust standard errors
p
wa
• Robust standard errors are more conservative (i.e. larger) compared to the regular
standard errors.
ks
• The main advantage of assuming constant variance is that the resulting tests are
more efficient (higher power), if the assumption of constant variance is correct.
in
• However, if the errors have non-constant variance, computing SEs under Assumption
Th
19
𝛽J − 𝛽J
≈ 𝑁 0,1
SE 𝛽J
We may also refer to the following approximation:
𝛽J − 𝛽J
≈ 𝑡[fZ
SE 𝛽J
• tn-2 denotes the t distribution with n -2 degrees of freedom.
• The t distribution is the exact distribution when the errors are a Gaussian with
constant variables (see module 4)
• The t distribution becomes identical to the normal distribution when the degrees of
freedom increase, such that the distinction is immaterial for n large enough for an
acceptable CLT approximation.
t
en
Hypothesis testing
m
Objective: test a hypothesis about 𝛽J using observational data and reach a tentative
cu
conclusion whether a linear relationship exists, or whether it is positive or negative.
Do
Two sided:
𝐻S : 𝛽J = 0
p
𝐻J : 𝛽J ≠ 0
wa
One sided:
𝐻S : 𝛽J = 0
ks
𝐻J : 𝛽J > 0
in
Th
20
estimator − value under 𝐻S
𝑡= ≈ 𝑡“
standard error of estimator
For testing a sample mean:
𝑌 − 𝜇S
𝑡=
𝑆V / 𝑛
For testing a regression slope:
𝛽J − 𝑏
𝑡=
SE 𝛽J
Testing the magnitude of 𝜷𝟏
• Large sample test:
t
en
𝛽J − 𝛽J
≈ 𝑡[fZ
SE 𝛽J
a) Construct the test statistic:
m
cu
𝛽J − 𝑏
𝑡=
Do
SE 𝛽J
p
h
c) The p-value is (the probability in the tails of the distribution outside |t|):
2×𝑃 𝑡[fZ > 𝑡
ks
in
Summary
Interpretation
21
• Correct interpretation of a p-value: it is the probability under the null hypothesis, of
getting a result at least as extreme as what we observed.
• A significant parameter means that we can reliably detect that it is not exactly zero.
• A non-significant parameter can mean that either:
a) The parameter is indeed exactly zero (which likely makes no difference to our
lives)
b) We cannot measure the effect sufficiently accurately (the standard error is
large) to reliably say that it is not exactly zero.
• NB if the parameter is not significant provide and discuss the confidence interval.
Confidence intervals
Confidence intervals always have the form:
t
en
point estimate ± critical value × standard error
using:
𝛽J − 𝛽J m
≈ 𝑡[fZ
cu
SE 𝛽J
Do
We have that an approximate 100×(1 − 𝛼)% confidence interval for 𝛽J in large samples is:
p
𝛽J ± 𝑡[fZ,™ × SE 𝛽J
wa
Z
Interpretation
ks
22
• Large sample Cis and test statistics
t
en
line 𝛽S + 𝛽J 𝑥.
• We then observe a new case from the population where 𝑋 = 𝑥S , without knowing
𝑌S . Our prediction for 𝑌S is 𝛽S + 𝛽J 𝑥S . m
cu
Do
wa
𝑚 𝑥 ≡ 𝐸 𝑌 𝑋 = 𝑥 = 𝛽S + 𝛽J 𝑥
ks
𝑚 𝑥 = 𝛽S + 𝛽J 𝑥
Th
Prediction error
We define the prediction error as:
𝜂S = 𝑌S − 𝑚 𝑥
under Assumption 1,
𝜂S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S
Interpretation:
The prediction error has two components
1. The estimation error for the regression line.
2. The unavoidable noise 𝜀S .
Theory and practice
• We need to make several assumptions about the model, however, we insist that
these assumptions are often unrealistic.
23
• Regression models can be very useful approximations to the truth
• Predictive modelling has a clear objective, which is to generate the best possible
predictions. We therefore propose models, measure and compare their performance
for prediction, and try to improve.
• Assumptions allow us to say more about a problem. In general, it is important to
understand in which situations violations cause serious problems.
In the SLR model we can drop Assumptions 1 and 2 (which essentially mean that the model
is correct) to say the following:
• Result: when 𝑛 → ∞, the least squares method will lead to the optimal linear
prediction of Y given X = x (in terms of mean-square error)
Sampling properties of the conditional estimator
en
Expected value and unbiasedness
From here on we are back to Assumptions 1 – 5.
m
cu
Since 𝛽S and 𝛽J are unbiased, the estimator of the conditional expectation is unbiased:
Do
𝐸 𝑚 𝑥 = 𝐸 𝛽S + 𝐸 𝛽J 𝑥 = 𝛽S + 𝛽J 𝑥
p
Variance
wa
1 𝑥−𝑋 Z
Var 𝑚 𝑥 = 𝜎 Z +
ks
𝑛 𝑛 − 1 𝑠cZ
Interpretation:
in
• It grows with 𝜎 Z . The more noise there is in the data, the harder it is to estimate the
Th
regression line.
• It is proportional to 1/n: the larger the sample size, the more accurate the regression
line.
• It has two components.
Ÿh
1. corresponds to the sampling uncertainty in estimating the mean of Y.
[
2. The slope decreases with 𝑠cZ , since a larger 𝑠cZ makes it easier to estimate the slope.
• It increases with 𝑥 − 𝑋 Z . The further we are from the center of the data, the larger
the sampling uncertainty.
Large sample approximation for the sampling distribution
For n sufficiently large, we can apply the CLT to approximate the distribution of 𝑚 𝑥 as:
Z
1 𝑥−𝑋 Z
𝑚 𝑥 ≈ 𝑁 𝛽S + 𝛽J 𝑥, 𝜎 +
𝑛 𝑛 − 1 𝑠cZ
24
Confidence interval for the conditional expectation
For n sufficiently large,
1 𝑥−𝑋 Z
𝑚 𝑥 ≈ 𝑁 𝛽S + 𝛽J 𝑥, 𝜎 Z
+
𝑛 𝑛 − 1 𝑠cZ
The approximate 100 ×(1 − 𝛼)% confidence interval for the regression line is therefore:
𝑚 𝑥 ± 𝑧™ Var 𝑚 𝑥
Z
Sampling properties of the SLR prediction error
t
Prediction error:
en
𝜂S = 𝑌S − 𝑚 𝑥S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S
m
The expected value and variance of the prediction error are:
cu
Do
𝐸 𝜂S = 𝐸 𝑚 𝑥S − 𝑚 𝑥S + 𝐸 𝜀S = 0
Var 𝜂S = Var 𝑚 𝑥S + Var 𝜀S
p
wa
J ¡^ fc h
= 𝜎 Z 1 + + h
[ [fJ ¢£
ks
Variance
in
1 𝑥S − 𝑋 Z
Var 𝜂S = 𝜎 Z 1 + +
Th
𝑛 𝑛 − 1 𝑠cZ
Interpretation:
• The first component is 𝜎 Z , associated with 𝜀S . We call this the irreducible error since
it does not decrease with the size of the estimation sample.
• The second is the variance of 𝑚 𝑥S , which has been derived above.
Prediction interval
A 100 ×(1 − 𝛼)% prediction interval 𝑌S,¤ , 𝑌S,¥ has the property that:
𝑃 𝑌S,¤ < 𝑌S < 𝑌S,¥ 𝑋S = 𝑥S = 1 − 𝛼
• The probability is over the random experiment of drawing a random sample of size n
from the population (for given values of the regressor), estimating 𝛽S and 𝛽J ,
computing 𝑌S,¤ and 𝑌S,¥ based on 𝛽S , 𝛽J and 𝑥S , and drawing 𝑌S |𝑋S = 𝑥S .
o NB This is a hard concept to grasp as it is a “mix” of computing a confidence
interval for 𝑚 𝑥S and computing a probability over the distribution of 𝜀S .
25
𝜂S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S
• Since the prediction error depends on 𝜀S , we can only build a prediction interval if
we explicitly estimate the distribution of the errors
o One option is to use a computational method called bootstrapping (beyond
our scope)
o Another option is to assume that 𝜀~𝑁 0, 𝜎 Z , in which case we can conduct
exact inference.
Technical details
en
The Gaussian SLR Model m
cu
Do
• So far analysis has not made any distributional assumptions about the regression
wa
26
A key feature of the Gaussian SLR Model is that we know the full form of the conditional
distribution of Y:
𝑌M |𝑋M = 𝑥M ~ 𝑁 𝛽S + 𝛽J 𝑥M , 𝜎 Z
Maximum likelihood estimation
Maximum likelihood (ML) estimation is available when we are able and want to specify a
full probabilistic model for the population.
• This is a new concept which we now introduce for the specific case of the Gaussian
SLR model.
• Intuitively, ML estimation chooses the values of the parameters that maximize the
probability of the observed data under the model.
t
𝑌M |𝑋M = 𝑥M ~ 𝑁 𝛽S + 𝛽J 𝑥M 𝜎 Z
en
m
The probability density function (PDF) evaluated at an observed value 𝑦M is:
cu
1 ¬ f- f- ¡ h
Z f e ^ h_ e
𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 = exp
Do
ZŸ
2𝜋𝜎 Z
p
This is the formula for the normal PDF from basic statistics. The difference is that in basic
wa
statistics it was formulate in terms of population mean 𝜇, while the SLR model specifies a
particular mean which is 𝛽S + 𝛽J 𝑥M .
ks
in
Review
Th
Recall that if two events A and B are independent if and only if their joint probability is the
product of their individual probabilities:
𝑃 𝐴 and 𝐵 = 𝑃 𝐴 . 𝑃 𝐵
When we talk about two independent continuous random variables U and V, their joint PDF
evaluated at U = u and V = v is the product of their individual PDFs:
𝑝 𝑢, 𝑣 = 𝑝 𝑢 𝑝 𝑣
Likelihood
The likelihood function is the joint PDF of the response evaluated at the observed sample
values. In our Gaussian SLR model, Assumption 4 (independence) implies that we can
multiply the PDFs for each observation:
27
[ [
1 ¬e f-^ f-_ ¡e h
Z f
𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 = exp ZŸ h
M\J M\J
2𝜋𝜎 Z
Log-likelihood
In practice, it is more convenient to work with the natural logarithm of the likelihood, the
log-likelihood:
𝐿 𝛽S , 𝛽J , 𝜎 Z = log 𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 Z
= log 𝑝 𝑦M 𝑥M ; 𝛽S , 𝛽J , 𝜎 Z
𝑛 𝑛 1
= − log 2𝜋 − log 𝜎 Z − Z 𝑦M − 𝛽S − 𝛽J 𝑥M Z
2 2 2𝜎
The likelihood and the log-likelihood have the same maximum points.
en
Maximum Likelihood estimation m
We then maximize the log-likelihood as a function of the parameters:
cu
Do
max h 𝐿 𝛽S , 𝛽J , 𝜎 Z
-^ ,-_ ,Ÿ
𝑛 𝑛 1
wa
𝐿 𝛽S , 𝛽J , 𝜎 Z = − log 2𝜋 − log 𝜎 Z − Z 𝑦M − 𝛽S − 𝛽J 𝑥M Z
2 2 2𝜎
ks
• The last term corresponds to the residual sum of squares (RSS) times a negative
multiplier.
in
o Therefore, it turns out that ML estimation of 𝛽S and 𝛽J is the same as OLS for
Th
this model.
• We maximize for 𝜎 Z separately by replacing the OLS residual sum of squares into the
formula (since we already got 𝛽S and 𝛽J without needing to worry about 𝜎 Z ). This
leads to:
1
𝜎Z = 𝑒MZ
𝑛
Properties
Beyond our specific example ML estimation has the following properties under general
conditions:
• Consistency
• Unbiasedness when 𝑛 → ∞ (asymptotic unbiasedness)
• Efficiency when 𝑛 → ∞. No other estimator that is unbiased when 𝑛 → ∞ has
smaller variance when 𝑛 → ∞.
• The sampling distribution of the ML estimator becomes Gaussian when 𝑛 → ∞.
• Accurate finite sample distribution approximations (if you trust the model, and with
enough computational power).
28
• Limitation: ML can give bad or pathological estimates in certain cases.
Discussion
We ended up back at OLS, what is the point?
• Added a new algorithm to the toolbox and started by understanding it in a sample
case
• Learned that if the errors are Gaussian or approximately Gaussian, we can be
confident that ML/OLS will work very well for SLR.
• ML is broadly applicable. It will be necessary for variations and generalizations of our
regression framework. OLS will no longer be an option or make sense in those cases.
• We need the concept of a log-likelihood for model selection
• ML and OLS give different estimates when the distribution of the errors is not
Gaussian.
t
Statistical inference
en
Fundamental property of Gaussian distributions m
cu
Let X and Y be two arbitrary random variables and a, b and c be non-random scalars. Now,
define the random variable:
Do
𝑊 = 𝑎𝑋 + 𝑏𝑌 + 𝑐
It follows that:
p
𝑊~𝑁 𝐸 𝑊 , Var 𝑊
wa
Where we previously reviewed the formulas for 𝐸(𝑊) and Var 𝑊
ks
Interpretation:
in
variable.
In Module 2 it was established that:
𝑋M − 𝑋
𝛽J = 𝛽J + 𝑐M 𝜀M = 𝛽J + 𝜀
𝑋M − 𝑋 Z M
Since 𝜀M is Gaussian for all i, it follows that 𝛽J is a linear combination of Gaussian random
variables.
Therefore for the Gaussian model:
𝜎Z
𝛽J ~𝑁 𝛽J ,
𝑛 − 1 𝑠cZ
where the mean and variance follow from earlier results, which hold regardless of the
distribution.
29
t-statistic
When estimating 𝜎 Z unbiasedly, we have:
𝛽J − 𝛽J
~𝑡[fZ
𝑆𝐸 𝛽J
where tn-2 denotes the t distribution with n – 2 degrees of freedom. This is an exact
sampling distribution.
Sampling distribution of the error variance estimator
In Gaussian SLR, we can show that:
𝑛𝜎 Z Z
~𝜒[fZ
𝜎Z
Z
where 𝜒[fZ denotes the chi-square distribution with n – 2 degrees of freedom
t
en
Review
m
• Let 𝑍J , 𝑍Z , … , 𝑍[ be n independent standard normal random variables. Then [M\J 𝑍MZ
cu
follows the 𝜒[Z distribution.
Do
Discussion
p
• In the Gaussian SLR model, we have exact sampling distributions and test statistics
wa
Th
Prediction interval
In the last module we investigated a case when we predict a new response Y0 for a given
value of x0 based on regression line estimated on a sample that is independent from Y0.
We defined the prediction error as:
𝜂S = 𝑌S − 𝑚 𝑥S = 𝑚 𝑥S − 𝑚 𝑥S + 𝜀S
We proved that:
𝐸 𝑒S = 0
and
1 𝑥S − 𝑋 Z
Var 𝑒S = 𝜎 Z 1 + +
𝑛 𝑛 − 1 𝑠cZ
In the Gaussian SLR model, both 𝑚 𝑥S and 𝜀S are Gaussian, so that if follows from the
same arguments as before that:
𝜂S ~𝑁 0, Var 𝜂S
A 100 ×(1 − 𝛼)% prediction interval for Y0 is:
30
𝑚 𝑥S ± 𝑧™ Var 𝑒S
Z
where 𝑧— is the approximate critical value. In practice we replace Var 𝑒S with an
h
estimate based on plugging SER into the formula.
t
Regression analysis has multiple objectives:
en
1. Understanding the relationship between X and Y.
2. Prediction: estimating a regression function 𝐸(𝑌|𝑋 = 𝑥).
m
a. “If we observe X = x, then we would expect 𝐸(𝑌|𝑋 = 𝑥)”
cu
3. Causal analysis: estimating 𝐸 𝑌 do 𝑋 = 𝑥).
a. This is an explicit intervention: “if we do X = x, then we would expect
Do
𝐸(𝑌|𝑋 = 𝑥)”
p
Example
wa
Interpretation
It is a mistake to make the second type of statement, when general regression analysis only
supports the first.
Interpretation of the slope coefficient
𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M
• Correct: if we select two cases from the population where X differs by 1, then we
would expect Y to differ by ß1.
• Acceptable: a unit change in X is associated with a ß1 change in Y on average in the
population.
• Incorrect (unless causal inference is supported): if we change X by 1, then Y will
change by ß1 on average.
31
Causal Analysis
• Run a randomized experiment. This is known as A/B testing in business.
Unfortunately, this is often not possible.
o Or find a natural experiment.
• Common sense principle: your study design must reflect what you want to measure.
If you want to estimate 𝐸 𝑌 do 𝑋 = 𝑥), then your data should have cases of “do X =
something” (with attention to the requirements for causal inference).
Examples
• Randomized experiment: randomly divide a cohort into two groups. Make half take
Maths in Business, the other half does not take it. Compare the results on
QBUS2810.
o You can see why randomized experiments are often not feasible.
• Natural experiment: suppose that the Business School makes Maths in Business
mandatory. We then compare the cohorts just before and after the external change,
t
en
perhaps “controlling” for the observed characteristics of the two cohorts (such as
their performance on first year units).
m
cu
Do
wa
to MLR.
in
𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M
Th
• The error, 𝜀, represents factors that influence the response Y but are not included in
the model. Hence there are always potential omitted variables.
• Omitted variable bias (OVB) in the OLS estimator occurs when the error term and X
are correlated.
𝐸(𝜀|𝑋) ≠ 0
NB this violates Assumption 2
• Therefore two conditions must hold for a particular omitted factor, Z, to include bias.
1) It needs to be related to Y (part of 𝜀)
2) It needs to be correlated with existing regressors
Theoretical Analysis
What is the size and direction of the bias?
𝑋M − 𝑋 𝜀M
𝛽J − 𝛽J =
𝑋M − 𝑋 Z
32
What if Cov 𝑋, 𝜀 = 𝜎c· ≠ 0?
In general:
𝑋M − 𝑋 𝜀M K Cov 𝑋, 𝜀 𝜌c 𝜎¹
𝛽J − 𝛽J = = ·
𝑋M − 𝑋 Z Var 𝑋 𝜎c
Omitted variable bias
K 𝜎¹
𝛽J 𝛽J + 𝜌c·
𝜎c
Interpretation:
• The OLS is positively biased (it overestimates ß1) when the regressor is positively
correlated with the error.
• The OLS is negatively biased (it underestimates ß1) when the regressor is negatively
correlated with the error.
• The bias does not get smaller with sample size. ß1 is inconsistent when 𝜌c· ≠ 0.
t
en
• The bias is larger when the correlation between the error and the regressor is larger
in absolute value.
m
• The bias is larger when the variance of the error is larger relative to the variance of
cu
X.
Do
Example
p
wa
ks
in
Th
• Marital status, age, home ownership, and purchase history stand out as being
positively related to both amount spent and salary.
• Hence, these regressor lead to omitted variables bias in the SLR
• Due to the positive correlation, 𝛽J is upward biased in the SLR model.
Omitted variable bias and causality
The presence of omitted variables rules out causal interpretation of estimated linear
regression coefficients.
33
Recall, the three conditions needed to establish causality from X to Y:
1) X and Y must co-vary or have a relationship. When one changes the other must also
change.
2) The independent variable X must precede the dependent variable Y in time.
3) No other factor could have possibly have caused the measured change in Y.
NB experimental studies eliminate OVB by assessing X independently of 𝜀.
Omitted variable bias and prediction
• We are not necessarily interested in estimating the population ß1.
• In many applications, we simply want to predict Y given X. Omitted variable bias is a
problem in this case only to the extent it limits the accuracy of the predictive model.
t
Measurement error in the regressor
en
• m
Measurement error in regressor is an issue for several areas of application of
cu
regression analysis, such as finance and economics.
• This is a special case of OVB, where the omitted variable is the actual regressor,
Do
o Example: CAPM. The CAPM says that the cross section of expected returns
should be a linear function of the asset betas.
o But we need to estimate beta first. This leads to a measurement error
p
wa
CAPM example
in
𝑌M = 𝛽S + 𝛽J 𝑋M + 𝜀M
Unfortunately, instead of xi, we observe:
𝑋i = 𝑥M + 𝑢M
where ui is an i.i.d. measurement error which is independent of 𝜀M and Xi and satisfies
𝐸 𝑢M = 0.
Hence the model we are trying to estimate is:
𝑌M = 𝛽S + 𝛽J 𝑋i + 𝜀M − 𝛽J 𝑢M
or
𝑌M = 𝛽S + 𝛽J 𝑋i + 𝛿M
where 𝛿M = 𝜀M − 𝛽J 𝑢M
Because 𝑋i depends on 𝑢M , which is part of 𝛿M ,
34
𝐸 𝑋i 𝛿M ≠ 0 → 𝐸 𝛿M |𝑋i ≠ 0, thus violating the SLR model assumption.
General case
Using similar arguments for the general OVB case,
K Cov 𝑋, 𝑌 𝜎cZ
𝛽J = 𝛽J Z
Var 𝑋 𝜎c + 𝜎»Z
Interpretation:
• When there is measurement error in the regressor, the slope estimator is pulled
h
Ÿ£
towards zero (the regression line is biased towards being flatter) since h < 1.
Ÿ£ ¼Ÿ½h
We call this an attenuation bias.
• The bias is more severe when the variance of the measurement error is larger
relative to the variance of the regressor.
t
en
m
cu
Multiple Linear Regression
Do
wa
𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
in
Th
A MLR describes the relationship between a numerical, continuous response variable Y and
multiple predictors X1, … , Xp.
Statistical Objective: To estimate the conditional expectation 𝐸 𝑌 𝑋J = 𝑥J , … , 𝑋K = 𝑥K for
predictive purposes.
• In some cases we may be interested in a causal prediction. However, this requires a
study design specific for this purpose.
Terminology and notation
𝑌M = 𝛽S + 𝛽J 𝑋JM + 𝛽Z 𝑋ZM + ⋯ + 𝛽K 𝑋KM + 𝜀M
• 𝑌 is the response or dependent variable.
• 𝑋J , … , 𝑋K are the predictors, independent variables, regressor, or features.
• 𝛽S is the intercept, a fixed and unknown population parameter.
• 𝛽J , … , 𝛽K are the regression coefficients, fixed and unknown population parameters.
35
• 𝜀 are random errors.
• The subscript 𝑖 indexes the observations, 𝑖 = 1, … , 𝑛.
• 𝑌M , 𝑋JM , 𝑋ZM , … , 𝑋KM denote random variables.
• 𝑦M , 𝑥JM , 𝑥ZM , … 𝑥KM denote observed values for the 𝑖 v¥ unit.
Parameter interpretation
Consider a case with two regressor:
𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + 𝜀
• 𝛽J is the expected difference in Y when we select two cases from the population
where X1 differs by one unit, and X2 is the same.
• 𝛽Z is the expected difference in Y when we select two cases from the population
where X2 differs by one unit, and X1 is the same.
X1 differs by ∆𝑥J , X2 is the same:
t
en
𝐸 𝑌 𝑋J = 𝑥J + ∆𝑥J , 𝑋Z = 𝑥Z − 𝐸 𝑌 𝑋J = 𝑥J , 𝑋Z = 𝑥Z
m
= 𝛽S + 𝛽J 𝑥J + ∆𝑥J + 𝛽Z 𝑥Z − 𝛽S + 𝛽J 𝑥J + 𝛽Z 𝑥Z = 𝛽J ∆𝑥J
cu
General rule:
Do
𝑌M = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
p
• 𝛽J is the expected difference in Y when we select two cases from the population
wa
Th
36
If X1 and X2 are correlated, then X1 and X2 predict each other. Hence, the predictive
information from X1 and X2 combined for Y does not amount to adding up the information
from X1 and X2 separately, as this would entail some “double counting”.
Notation and terminology
Fitted values:
𝑌i = 𝛽S + 𝛽J 𝑋JM + 𝛽Z 𝑋ZM + ⋯ + 𝛽K 𝑋KM
Residuals:
𝑒M = 𝑌M − 𝑌i
Measuring fit
t
en
Standard error of the regression
The SER estimates the standard deviation of the model errors 𝜀.
m
cu
[
1
SER = 𝑒MZ
Do
𝑛−𝑝−1
M\J
estimate 𝑝 + 1 parameters).
ks
Th
Z Z
𝑌M − 𝑌i = 𝑌i − 𝑌i + 𝑒MZ
TSS = ResSS + RSS
• TSS: total sum of squares
• RegSS: regression sum of squares
• RSS: residual sum of squares
R2
Z
RegSS RSS 𝑒MZ
𝑅 = =1− =1− Z
TSS TSS 𝑌M − 𝑌i
Interpretation:
• R2 measures the proportion of the variance in the response data that is accounted
for by the estimated linear regression model.
37
• R2 can only increase when you add another variable to the model.
• R2 is an useful part of the regression toolbox, but it does not measure the predictive
accuracy of the estimated regression, or more generally how good the model is.
Adjusted R2
The adjusted R2 penalizes the R2 for the number of regressors.
𝑛−1 RSS
𝑅Z = 1 −
𝑛 − 𝑝 − 1 TSS
• The 𝑅Z has the same interpretation as R2
• Unlike R2, 𝑅Z may increase or decrease upon the addition of another regressor.
• 𝑅Z < 𝑅Z . If n is large then 𝑅Z ≈ 𝑅Z .
• The adjusted R2 can be negative.
t
en
Statistical inferences for MLR I
m
cu
The MLR model
Do
Assumptions
p
𝐸 𝜀 𝑋J , … , 𝑋K = 0.
in
38
• Another example leading to multicollinearity is when there are more predictors than
observations: 𝑝 > 𝑛.
Terminology
• Perfect multicollinearity: one or more predictors are an exact linear combination of
the other predictors.
• Multicollinearity: one or more predictors are highly correlated with a linear
combination of other predictors.
• Collinearity: two predictors are highly correlated with each other.
Beyond the assumptions
t
The MLR model assumptions re not sufficient to guarantee reliable estimation and inference
en
in practice. Analysis should always consider:
• Outliers (unusual observations)
• Very skewed errors
m
cu
• High correlation amongst predictors
Do
Assumption checking
wa
Residual diagnostics
ks
Fitted values against residuals
Th
39
Predictors against residuals
t
Fitted values against squared or absolute residuals
en
m
cu
Do
wa
ks
in
Th
Predictors against squared or absolute residuals
40
Residual distribution
t
en
m
cu
Do
wa
ks
Th
41
Variance
𝜎Z 𝜎Z
Var 𝛽Â = Z =
1 − 𝑅€Z ( [
M\J 𝑥€M − 𝑥Â ) 1 − 𝑅€Z 𝑛 − 1 𝑠¡ZÃ
𝑅€Z : the r-squared of a regression of predictor j on all other predictors
𝑥M€ : observed value of predictor j for observation i.
𝑥Â : sample average predictor j.
𝑠¡ZÃ : sample variance of predictor j
Interpretation:
• The formula is similar to the one for SLR but with the addition of 1/(1 − 𝑅€Z ).
t
en
• The higher the correlation of predictor j with other predictors, the higher the
variance of 𝛽Â .
m
• The variance of 𝛽Â is proportional to the variance of the errors 𝜎 Z .
cu
• The variance decreases with the sample variance of predictor j.
Do
Why does a higher correlation with other predictors increase the variance of 𝛽Â , all else
equal?
p
• 𝛽€ ∆𝑥€ is the expected difference in Y when we select two cases from the population
wa
as defined.
Th
Sampling distribution: Gaussian errors
In addition to Assumptions 1 – 6, suppose we assume that the errors are normally
distributed,
𝑌 = 𝛽S + 𝛽J 𝑥J + 𝛽Z 𝑥Z + ⋯ + 𝛽K 𝑥K + 𝜀 𝑁 0, 𝜎 Z
Then:
𝛽Â ~𝑁 𝛽€ , Var 𝛽Â
Sampling distribution: general case
Assumptions 1 – 6 only gives moment conditions for errors:
𝐸 𝜀 𝑋J , … , 𝑋K = 0.
Var 𝜀 𝑋J , … , 𝑋K = 𝜎 Z
Therefore we do not know the distribution of 𝛽Â in the general case, apart from its mean
and variance.
42
For n sufficiently large, we may choose to rely on the CLT approximation:
𝛽Â ≈ 𝑁 𝛽€ , Var 𝛽Â
Appendix
Conditional inferences
• The model treats the predictors 𝑋J , … , 𝑋K as random. This assumption reflects the
nature of observational data.
• However, when discussing statistical inference we treat the predictors as given.
o For example, we provide the variance of the least squares estimator as a
function of the observed predictor values.
o This is because we naturally work with observed predictors
• The sample uncertainty of interest to us is therefore over errors 𝜀J , 𝜀Z , … , 𝜀[ .
t
en
Statistical inferences for MLR II
m
cu
Standardised coefficients
Do
Standard errors
p
SER
ks
SE 𝛽Â =
Z
in
1− 𝑅€Z [ 𝑥€M − 𝑥Â ]
Th
Where the standard error of the regression estimates 𝜎:
1
SER = 𝑒MZ
𝑛−𝑝−1
Robust standard errors
• Use robust standard errors when the assumption of constant variance is not satisfied
for the data.
o NB the technical details for robust standard errors are not discussed for MLR
Sampling distribution in Gaussian MLR model
If the errors are Gaussian, we can show that:
𝛽Â − 𝛽€
~𝑡[fKfJ
SE 𝛽Â
43
Inference for one coefficient
Sampling distribution: general case
If the distribution of errors is unspecified but the sample size is sufficiently large, we use the
CLT approximation.
𝛽Â − 𝛽€
≈ 𝑁 0,1
SE 𝛽Â
or
𝛽Â − 𝛽€
≈ 𝑡[fKfJ
SE 𝛽Â
Hypothesis testing
Relationship Test:
t
en
Two sided:
𝐻S : 𝛽€ = 0
m
𝐻J : 𝛽€ ≠ 0
cu
Interpretation:
- If the null hypothesis is correct, there is no relationship between predictor j and the
Do
One sided:
ks
𝐻S : 𝛽€ = 0
𝐻J : 𝛽€ > 0
in
Th
General Test:
Two sided:
𝐻S : 𝛽€ = 𝑏
𝐻J : 𝛽€ ≠ 𝑏
One sided:
𝐻S : 𝛽€ = 𝑏
𝐻J : 𝛽€ > 𝑏
Test Statistic
𝛽Â − 𝑏€
𝑡-Ä = ~𝑡[fKfJ
SE 𝛽Â
• We can carry out hypothesis testing on 𝛽€ using the standard t-statistic with degrees
of freedom n – p – 1
• The test is a large sample approximation when the errors are not Gaussian.
44
Summary of Hypothesis testing
t
en
m
cu
Do
Example of Hypothesis Testing:
p
wa
ks
in
Th
Test: 𝐻S : 𝛽J = 0 vs 𝐻J : 𝛽J ≠ 0
Significance level: 𝛼 = 0.05
45
Assumptions: MLR model assumptions with non-constant error variance, n = 1000 is
sufficiently large for a CLT approximation (no outliers)
Estimator: 𝛽Â (OLS). Robust standard error.
-Ä f]Ã
Test statistic: 𝑡-Ä = ~𝑡[fKfJ
ÅÆ -Ä
S.SJÇÇ
Calculated statistic: 𝑡-Ä = = 23.5 (from the output)
S.SSJ
Decision: the p-value of < 0.0001 (from the output) is lower than 𝛼 = 0.05. Alternatively,
23.5 > 𝑡ÇÇÉ,S.SZÊ = 1.96.
Conclusion: We reject the null hypothesis
t
en
Why the p-value does not measure coefficient importance m
cu
𝛽Â
Do
𝑡-Ä =
SER
1 − 𝑅€Z 𝑛 − 1 𝑠¡ZÃ
p
wa
• A lower correlation with the other predictors increase the test statistic.
Th
46
Inference for multiple coefficients
Consider the linear model:
𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
Test:
𝐻S : 𝛽J = 𝛽Z = ⋯ = 𝛽K = 0
𝐻J : at least one 𝛽€ ≠ 0
F Statistic
RegSS/𝑝
𝐹¢vÎv =
RSS/(𝑛 − 𝑝 − 1)
Interpretation:
• TSS = RegSS + RSS decomposes the variation in the data between the variation
t
accounted for by the estimated regression (RegSS) and the unaccounted variation
en
(RSS).
m
• If the null hypothesis is correct, we would expect a relatively low RegSS and a
relatively high RSS, leading to a small F statistic.
cu
Do
Understanding the F test: Chi-squared distribution
p
𝑍~𝑁 0,1 ⟹ 𝑉 = 𝑍 Z ~𝜒 Z
wa
“
ks
M\J
Th
47
Understanding the F test: general case
ÑÒÓÅÅ ÑÅÅ
Under the null hypothesis, for a Gaussian MLR model h and h follow 𝜒 Z distributions
Ô Ô
with p and (n – p – 1) degrees of freedom respectively.
From the definition of the F distribution:
RegSS/𝑝 χZK /𝑝
𝐹¢vÎv = ~ Z ≡ 𝐹K,[fKfJ
RSS/(𝑛 − 𝑝 − 1) 𝜒[fKfJ /(𝑛 − 𝑝 − 1)
In the general case, we use the F distribution as a large sample approximation.
Key concepts of F test
• F test is a one-sided test. We reject the null hypothesis for high values of the F
statistic.
• We denote the critical value as 𝐹K,[fKfJ,™ and define it as the value such that
𝑃 𝐹K,[fKfJ > 𝐹K,[fKfJ,™ = 𝛼
t
• The p-value is 𝑃 𝐹K,[fKfJ > 𝑓Öz×z , where 𝑓Öz×z is the test statistic calculated for the
en
sample.
m
• A heteroscedasticity robust version of the F test is available but the formulation
cu
departs substantially from the one that we have for the constant variance case.
Do
p
wa
ks
in
Th
ANOVA table
Sometimes the F test is reported as an ANOVA table.
48
F statistic: R2 formulation
Recall that
RegSS RSS
𝑅Z = =1−
TSS TSS
We can then rewrite the F stat as:
RegSS/𝑝
𝐹Öz×z =
RSS/(𝑛 − 𝑝 − 1)
(ÑÒÓÅÅ/ØÅÅ) ([fKfJ)
= ×
(ÑÅÅ/ØÅÅ) K
Ùh ([fKfJ)
= ×
JfÙ h K
Example
×ÚÛ
Test: 𝐻S : 𝛽J = 𝛽Z = 0 vs 𝐻J : 𝛽J ≠ 0 β ≠ 0
ÜÝ Z
Significance level: 𝛼 = 0.05
t
en
Assumptions: MLR model assumptions with non-constant error variance, n = 1000 is
sufficiently large for a CLT approximation (no outliers) m
cu
Estimator: OLS
Do
ÑÒÓÅÅ/K
Test statistic: 𝐹¢vÎv = ~𝐹K,[fKfJ
p
ÑÅÅ/([fKfJ)
wa
Decision: the p-value of < 0.0001 (from the output) is lower than 𝛼 = 0.05. Alternatively,
in
Conclusion: We reject the null hypothesis
F test for q linear restrictions
Appendix
49
Data transformation
Introduction and Example
Data transformation consists of applying a deterministic mathematical function to each
observation of a response or predictor.
Data transformation is typically used for the following purposes:
• Modelling nonlinearity
• Meeting the assumption of constant error variance
t
• Reducing skewness
en
Example: Direct Marketing m
In MLR modules we introduced a predictive model for the amount spent by customers:
cu
Do
AS = βS + 𝛽J ×Salary + 𝛽à ×Catalogs + ε
p
wa
ks
in
Th
• The relationship between amount spent and salary, holding catalogs constant was
represented in a scatter plot with a regression line.
• The model was then estimated:
50
53.68 0.0199 51.695
AS = − + ×Salary + ×Catalogs RZ = 0.612 RZ = 0.611
(659.15) (0.001) (2.912)
• The next step was to check whether the model fits the data
• The assumptions are laid out
• Multiple graphs are plotted:
o Fitted values against residuals
o Fitted values against squared residuals
o Residual distribution
• The diagnostics (see images in module 7) reveal the following limitations:
o The residuals follow a nonlinear pattern. Hence, Assumptions 1 and 2 cannot
be correct.
o The residuals have non-constant variance. Assumption 3 is not correct.
o The residuals are positively skewed. Skewed errors do not violate any
assumption, but are less ideal.
• The regression can be improved by implementing log transformations.
t
en
Log transformations
m
cu
Do
wa
ks
• The interpretation of the slope coefficient is different in each case
in
51
A 1% difference in X (a 0.01 difference in log(X)) is associated with a 0.01ß1 expected
difference in Y.
t
en
m
Log-linear regression function
cu
log(𝑦) = 𝛽S + 𝛽J 𝑥
Do
Consider a change in y:
log(𝑦 + ∆𝑦) = 𝛽S + 𝛽J 𝑥 + ∆𝑥
Subtracting the two:
p
∆𝑦
wa
∆𝑦/𝑦
𝛽J ≈ for small ∆𝑥
in
∆𝑥
Th
52
en
m
cu
Do
Log transformation Summary
p
wa
ks
in
Th
53
Estimating the conditional expectation
Consider the model:
log 𝑌 = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
where X1, … , Xp are arbitrary predictors that may involve transformations.
• This specification means that:
𝐸 log 𝑌 𝑋J = 𝑥J , … , 𝑋K = 𝑥K = 𝛽S + 𝛽J 𝑥J + ⋯ + 𝛽K 𝑥K
• However, we are more often interested in knowing:
𝐸 𝑌 𝑋J = 𝑥J , … , 𝑋K = 𝑥K
Considering the initial model and solving for y:
𝑌 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
𝑌 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K exp ( 𝜀)
t
en
The conditional expectation is:
𝐸 𝑌 𝑋 = 𝑥 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K 𝐸(exp ( 𝜀) 𝑋 = 𝑥
m
cu
There are two conditions here:
Do
wa
Gaussian MLR
If we assume that
ks
𝜀~𝑁 0, 𝜎 Z
in
𝜎Z
𝐸 exp 𝜀 = exp
2
Therefore in this case:
𝜎Z
𝐸 𝑌 𝑋 = 𝑥 = exp 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K +
2
NB: if the errors are not Gaussian, we can use this result as an approximation.
When we estimate the parameters by OLS, we replace the unknown error variance by its
usual estimator:
𝑆𝐸𝑅Z
𝑚 𝑥 = exp 𝛽S + 𝛽J 𝑥J + ⋯ + 𝛽K 𝑥K +
2
General case
Duan (1983) proposed the following estimator that we can use when the errors are non-
Gaussian:
1
𝑚 𝑥 = exp 𝛽S + 𝛽J 𝑥J + ⋯ + 𝛽K 𝑥K exp 𝑒M
𝑛
54
Which estimator to use?
• The bias corrected estimators do not necessarily improve accuracy, as they have a
cost in terms of variance.
• The naïve back-transformation may perform better in some cases.
• Compute fitted values or predictions to investigate which approach performs better.
Example: Direct Marketing
Comparison of the regression models in terms of in-sample fit in the original scale.
t
en
m
cu
Do
p
wa
• The log-log specification is the best (in-sample) fit for this data. The bias corrected
estimate gives a slight improvement.
ks
1
RSME = 𝑦M − 𝑦i Z
Th
55
Power transformations
In the above example using the log function made the response negatively skewed.
t
en
• Power transformations are a more general type of transformation that allows you to
obtain approximately symmetrical, or even normal distributions.
• Some common power transformations are: m
cu
o Square root: 𝑌
o Inverse: 1/𝑌
Do
o Square: 𝑌 Z
o Box-Cox:
p
wa
Box-Cox transformation
The Box-Cox transformation is a one parameter family of transformations defined as:
ks
𝑦ã − 1
in
ã
𝑦 = if λ ≠ 0
𝜆
Th
log 𝑦 if λ = 0
ã
𝑦ã − 1
𝑦 =
𝜆
• Lower values of 𝜆 reduce positive skewness.
• Higher values of 𝜆 reduce negative skewness.
Box-Cox regression
𝑌 ã = 𝛽S + 𝛽J 𝑋J + 𝛽Z 𝑋Z + ⋯ + 𝛽K 𝑋K + 𝜀
• The disadvantage of the Box-Cox regression is that the coefficients do not have
convenient interpretations as in the log regression.
• This highlights another possible trade-off in statistical modelling: accuracy v
interoperability
56
t
en
m
cu
Do
wa
ks
in
Th
57
Th
in
ks
wa
p
Do
cu
m
en
t
58