You are on page 1of 138

Introducing Multiple Regression

Simple Regression

Cause Effect
Independent variable Dependent variable
Simple Regression
Oil Prices

Government
Bond Yields

One cause, one effect

Multiple Regression

Causes Effect
Independent variables Dependent variable
Multiple Regression
Oil Prices

Government
Bond Yields
S&P 500
Share Index

Many causes, one effect

Simple Regression
Y
(x1, y1)

(x2, y2)

(x3, y3)
Regression Line:
y = A + Bx
(xn, yn)

Represent all n points as

(xi,yi), where i = 1 to n
Multiple Regression
Y
(x1, y1, z1)

(x3, y3, z3)

Regression Plane:
(xn, yn, zn) y = A + Bx + Cz

Represent all n points as

Z (xi,yi,zi), where i = 1 to n
Multiple Regression

Causes Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt
E1 1 D1

[] [] [] [][]
O1 e1
E2 1 D2 O2 e2
E3 = A +B D3 +C O3 + e3
1
… … … …

En Dn On en
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Multiple Regression
Regression Equation:
y = A + Bx + Cz

y1 1 x1

[ ] [] [][][]
z1 e1
y2 1 x2 z2 e2
y3 =A +B x3 +C z3 + e3
1
… … … …

yn xn zn en
1
Multiple Regression
Regression Equation:
y = A + Bx + Cz

y1 e1

[ ] [ ] []
1 x1 z1
y2 e2
[ ]
A
1 x2 z2
y3

yn
=
1

x3

xn
z3

zn
* B
C
+ e3

en
1
n Rows, n Rows, 3 Rows, n Rows,
1 Column 3 Columns 1 Column 1 Column
Multiple Regression

2 Causes 1 Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression

k Causes 1 Effect
Dow Jones index, Exxon stock
price of oil, bond yields…
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

y1 1 x11 x1k e1

[ ] [] [] [][]
y2
y3

yn
= C1
1
1

+ C2
x21
x31

xn1
+ … Ck+1
x2k
x3k

+
e2
e3

xnk en
1
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

y1 e1

[ ] [ ] []
1 x11 x1k
y2 e2
[ ]
C1
1 x21 x2k
y3

yn
=
1

x31

xn1
… x3k

xnk
* C2

Ck+1
+ e3

en
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

Multiple regression involves finding

k+1 coefficients, k for the explanatory
variables, and 1 for the intercept
Estimation Methods in Multiple Regression

Maximum
Method of Method of least
likelihood
moments squares
estimation

The method of least squares works for

multiple regression too
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk

The “best fit” line is the one where

the sum of the squares of the
lengths of the errors is minimised
Risks in Multiple Regression
Simple and Multiple Regression

Simple Regression Multiple Regression

Data in 2 dimensions Data in > 2 dimensions
Simple and Multiple Regression

Simple Regression Multiple Regression

Risks exist, but can usually be Risks are more complicated,
mitigated analysing R2 and require interpreting regression
residuals statistics
Risks in Simple Regression

No cause-effect Mis-specified Incomplete

relationship relationship relationship
Regression on completely Non-linear (exponential Multiple causes exist, we
unrelated data series or polynomial) fit have captured just one
Diagnosing Risks in Simple Regression

No cause-effect Mis-specified Incomplete

relationship relationship relationship
high R2, residuals are not
low R2,plot of X ~ Y has low R2, residuals are not
independent of each
no pattern independent of x
other
Mitigating Risks in Simple Regression

No cause-effect Mis-specified Incomplete

relationship relationship relationship
Wrong choice of X and Y Transform X and Y - Add X variables (move to
- back to drawing board convert to logs or returns multiple regression)
The big new risk with multiple
regression is multicollinearity: X
variables containing the same
information
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1

y1 1

[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3

yn
=

nx1
1

x31

xn1
… x3k-1

xnk-1
*
nxk
C2

Ck
kx1

1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1

y1 1

[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3

yn
=
1

x31

xn1
… x3k-1

xnk-1
* C2

Ck
1
X1 Xk
Bad News: Multicollinearity Detected
X1

High R2

Xk

Highly correlated explanatory variables

Good News: No Multicollinearity Detected
X1

Low R2

Xk

Uncorrelated explanatory variables

Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt

E1 1 D1

[ ] [] [] []
O1
E2 1 D2 O2
E3 = A +B D3 +C O3
1
… … …

En Dn On
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Good News: No Multicollinearity Detected
DOW
Returns

Low R2

OIL

Uncorrelated explanatory variables

Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C NASDAQt

E1 1 D1

[ ] [] [] []
N1
E2 1 D2 N2
E3 = A +B D3 +C N3
1
… … …

En Dn Nn
1
Ei = % return Di = % return of Ni = % return of
on Exxon stock Dow Jones NASDAQ index
on day i index on day i on day i
Bad News: Multicollinearity Detected
DOW

High R2

NASDAQ

Highly correlated explanatory variables

Multicollinearity Kills Regression’s Usefulness

Explaining Variance Making Predictions

The R2 as well as the regression The regression model will
coefficients are not very reliable perform poorly with out-of-
sample data
Multicollinearity: Prevention and Cure

Common Sense Nuts and Bolts Heavy Lifting

Big-picture Setting up data right  Factor analysis,
understanding of the Principal components
data  analysis (PCA)
Think deeply about each x variable
Eliminate closely related ones
Dig down to underlying causes
Common
Sense
Multiple Regression

Proposed Regression Equation:

EXXONt = A + B DOWt + C NASDAQt + D OILt
% return on % return of % return of % change in
Exxon stock on Dow Jones NASDAQ index price of oil on
day i index on day i on day i day i
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks

Do we really need both Dow and NASDAQ returns as

explanatory variables?
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt

Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks

If yes - consider keeping one, and constructing a

new explanatory variable of their difference
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C OILt

Dow Jones
Oil Prices
Industrial Average
Price of barrel of oil
30 Large-cap US stocks

What underlying factors drive both US large-cap

stocks and the price of oil?
Common Sense

What underlying factors drive both US large-cap

stocks and the price of oil?
Common Sense

What underlying factors drive both US large-cap

stocks and the price of oil?
Multiple Regression

Original Regression Equation:

EXXONt = A + B DOWt + C NASDAQt + D OILt

Revised Regression Equation:

EXXONt = A + B DOWt + C INTERESTt + D GDPt
‘Standardise’ the variables
Rely on adjusted-R2, not plain R2
Set up dummy variables right
Distribute lags
Nuts and Bolts
Find underlying factors that drive
the correlated x variables
Principal Component Analysis
(PCA) is a great tool
Heavy Lifting
Multiple Regression

Proposed Regression Equation:

HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
% change in yield on 5- yield on 10-
home prices in year bond in year bond in …
month i month i month i
Bad News: Multicollinearity Detected
Xi

Time

Factor Analysis

3-month 1-year government 5-year government

government bonds bonds bonds

1-day (overnight) 30-year government 5-year swap rate

money market bonds (inter-bank)

Interest rates on a wide variety of fixed-income

instruments
Factor Analysis on Interest Rates

Level Slope Twist

How high are interest How steep is the yield How convex is the yield
rates? curve? curve?

Three uncorrelated factors explain most variation in

all interest rates
The factors identified are
guaranteed to be uncorrelated

However, they may not have an

intuitive interpretation

Factor Principal Component Analysis is one

Analysis procedure for factor analysis
Dimensionality Reduction via Factor Analysis

20-dimensional 3-dimensional
data data

Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Dimensionality Reduction via Factor Analysis

[ ]
x11 x1k-1 f11 f12 f13
x21

xn1
x2k-1
x31 … x3k-1

xnk-1
[ ]
f21
f31

fn1
f22
f32

fn2
f23
f33

fn3
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Factor Analysis is a dimensionality-
reduction technique to identify a
few underlying causes in data
Multiple Regression
Proposed Regression Equation:
HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …

Principal
Component
Analysis

Revised Regression Equation:

HOMEt = A + B LEVELt + C SLOPEt + D TWISTt
Benefits of Multiple Regression
Simple Regression Is a Great Tool

Powerful Versatile Deep

Perfectly suited to two Easily extended to non- The first “crossover hit”
common use-cases linear relationships from Machine Learning
Multiple Regression Is Even Better

Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better

Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Two Common Applications of Regression

Explaining Variance Making Predictions

How much variation in one data How much does a move in one
series is caused by another? series impact another?
Controlling For Different Causes

Proposed Regression Equation:

EXXONt = A + B DOWt + C OILt
% return on % return of % change in
Exxon stock on Dow Jones price of oil on
day i index on day i day i

All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

Proposed Regression Equation:

EXXONt = A + B DOWt + C OILt

All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

EXXONt = A + B DOWt + C OILt

EXXON? = A + B DOWt + C (OILt + 1%)

All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

EXXONt = A + B DOWt + C OILt

EXXON? = A + B DOWt + C (OILt + 1%)

All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

EXXONt = A + B DOWt + C OILt

- EXXON? = A + B DOWt + C (OILt + 1%)

All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Controlling For Different Causes

EXXONt = A + B DOWt + C OILt

- EXXON? = A + B DOWt + C (OILt + 1%)

Change in EXXON = C

All else being equal, how much will Exxon stock

move by if oil prices increase by 1%?
Regression coefficients tell how
much y changes for a unit change
in each predictor, all others being
held constant
Multiple Regression Is Even Better

Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better

Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Multiple Regression Is Even Better

Powerful Versatile Deep

Also controls for effects Also works with Especially if combined
different causes categorical data with factor analysis
Interpreting the Results of a
Regression Analysis
Interpreting Results of a Simple Regression

R2 Residuals
Measures overall quality of fit - the Check if regression assumptions are
higher the better (up to a point) violated

Standard errors of individual coefficients are usually

of little significance
Interpreting Results of a Multiple Regression

Adjusted R2 Residuals F-statistic

Standard Errors
R2
of coefficients
e = y - y’
=> y = y’ + e
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)

A Not-Very-Important Intermediate Step

Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
e = y’ - y
=> y = y’ + e
Always = 0
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)

A Leap of Faith
This is important - more on why in a bit
Variance(y) = Variance(y’) + Variance(e)

Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
Variance(y) = Variance(y’) + Variance(e)

Total Variance (TSS)

A measure of how volatile the dependent variable is, and of much it moves around
TSS = Variance(y’) + Variance(e)

Explained Variance (ESS)

A measure of how volatile the fitted values are - these come from the regression line
TSS = Variance(y)
TSS = ESS + Variance(e)

Residual Variance (RSS)

This the variance in the dependent variable that can not be explained by the
regression
TSS = Variance(y) ESS = Variance(y’)
TSS = ESS + RSS

Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
TSS = Variance(y) ESS = Variance(y’) RSS = Variance(e)
R2 = ESS / TSS

R 2
The percentage of total variance explained by the regression. Usually, the higher the
R2, the better the quality of the regression (upper bound is 100%)
R2 = ESS / TSS

R 2
In multiple regression, adding explanatory variables always increases R2, even if those
variables are irrelevant and increase danger of multicollinearity
Adjusted-R2 = R2 x (Penalty for adding irrelevant variables)

Adjusted-R2
Increases if irrelevant* variables are deleted

(*irrelevant variables = any group whose F-ratio < 1)

Extending Multiple Regression to
Categorical Variables
A Simple Regression

Proposed Regression Equation:

y = A + Bx
Height of Average height
individual of parents
A Simple Regression
y
Male

Female

Regression Line:
y = A + Bx

Not a great fit - regression line is far from all points!

A Simple Regression
y
Male

Female

y = A1 + Bx

A1

We can easily plot a great fit for males…

A Simple Regression
y
Male

Female

y = A2 + Bx

A2

…and another great fit for females

A Simple Regression
y Regression Line For Males:
Male
y = A1 + Bx

y = A2 + Bx

A1
A2
x

Two lines - same slope, different intercepts

Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=0 for males

y = A1 + (A2 - A1)D + Bx
= A1 + Bx
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=1 for females

y = A1 + (A2 - A1) + Bx

= A2 + Bx
Adding A Dummy Variable
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents

Combined Regression Line:

y = A1 + (A2 - A1)D + Bx
D=0 for males
=1 for females
Adding A Dummy Variable

Combined Regression Line:

y = A1 + (A2 - A1)D + Bx

=1 for females

The data contained 2

groups, so we added 1
dummy variable
Given data with k groups, set up k-1
dummy variables, else
multicollinearity occurs
Dummy and Other Categorical Variables

Dummy Variables Categorical Variables

Finite set of values - e.g. days of
Binary - 0 or 1
week, months of year…

To include non-binary categorical

variables, simply add more dummies
Testing for Seasonality

Proposed Regression Equation:

y = A + BQ1 + CQ2 + DQ3
Average stock Quarter of the
returns year

The data contains 4 groups, so we

added 3 dummy variables
Testing for Seasonality

The data contains 4 groups, so we

added 3 dummy variables
Q1 = 1 for Jan, Feb, Mar
=0 for other quarters
Q2 = 1 for Apr, May, Jun
=0 for other quarters
Q3 = 1 for July, Aug, Sep
=0 for other quarters
Different Groups, Different Slopes
y
Male
y = A1 + B1 x

y = A2 + B2 x
Female
x

Dummy variables can also be extended for use

where groups have different slopes
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x

Combined Regression Line:

y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D1 = 0 for males D2 = 0 for males
=1 for females =x for females
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x

For males: D1 = 0
y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D2 = 0

= A1 + B1 x
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x

For females: D1 = 1
y = A1 + (A2 - A1)(1) +
B1x + (B2 - B1)x D2 = x
= A1 + (A2 - A1) +
B1x + (B2 - B1)x
= A2 + B2 x
Dummy Variables

X Y
Linear regression Logistic regression
Normal Distribution
μ

N(μ,σ)

Average (mean) is μ

Standard deviation is σ
Standard Errors

E(α) = A E(β) = B

Sampling Distribution of α Sampling Distribution of β

α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter

Standard error of a regression parameter is the

standard deviation of the sampling distribution
Strong Cause-effect Relationship
Y

High R2

Residuals are small, standard errors are small

Weak Cause-effect Relationship
Y

Low R2

Residuals are large, standard errors are large

Standard Errors and Residuals

Low Standard Error High Standard Error

High confidence that parameter Low confidence that parameter
coefficient is well estimated coefficient is well estimated

The smaller the residuals, the smaller the standard

errors and the better the quality of the regression
Sample Regression Line
Regression Equation:
y = A + Bx

y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
Sample Regression Line
Regression Equation:
y = A + Bx
Residuals

y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
RSS = Variance(e)

Residual Variance (RSS)

Easily calculated from regression residuals
SE(α), SE(β) can be found from RSS

Estimate Standard Errors from RSS

Exact formulae are not important - reported by Excel, R…
The smaller the residuals, the smaller
the standard errors and the better
the quality of the regression
Standard Errors

E(α) = A E(β) = B

Standard Error of α Standard Error of β

α is the population parameter, A is the β is the population parameter, B is the
sample parameter sample parameter

Standard error of a regression parameter is the

standard deviation of the sampling distribution
Null Hypotheses

What if the population parameter α

were actually zero?
Call this the null hypotheses H0
Null Hypotheses: α = 0

E(α) = 0

t x SE(α)

α=A

If this were actually true, how likely is it that our

sample regression would yield the estimate α = A?
Why Zero?

Y Y

X X

Sample Regression Line Population Regression Line

y = A + Bx y = α + βx

If α = 0, it is adding no value in the regression line

and should just be excluded
Null Hypotheses: α = 0

E(α) = 0

t x SE(α)

α=A

The farther from the mean, the more unlikely that

α=0
t-Statistics

E(α) = 0 E(β) = 0

0.85xSE(α)

A
9.01xSE(β)

B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)

We are now testing a hypothesis, that the population

parameter is actually zero
t-Statistics

E(α) = 0 E(β) = 0

0.85xSE(α)

A
9.01xSE(β)

B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)

High t-statistic => Yes

The higher the t-statistic of a
coefficient, the higher our confidence
in our estimate of that coefficient
p-Values

E(α) = 0 E(β) = 0

0.85xSE(α)

A 0.39 2 x 10-15

9.01xSE(β)

p-value(α) = 0.39 p-value(β) = 2 x 10-15 ~ 0

Low t-stat, high p-value High t-stat, low p-value

Is an individual estimate of α or β ‘adding value’ at all?

low p-value => Yes
The lower the p-value of a coefficient,
the higher our confidence in our
estimate of that coefficient
RSS
SER =
n-2

Standard Error of Regression (SER)

n is the number of points in the regression.

SER provides an unbiased estimator of error variance σ2

RSS
σ2
~ χ2

χ2 Distribution
Never mind the fine print about degrees of freedom for now
Null Hypotheses

What if all population parameters

were zero? i.e. β = α = 0
Call this the null hypotheses H0
Null Hypotheses: β = α = 0
β = 0, α = 0
F-statistic

β = B, α = A

If this were actually true, how likely is it that our

sample regression would yield the estimate
β = B, α = A?
Why Zero?

Y Y

X X

Sample Regression Line Population Regression Line

y = A + Bx y = α + βx

If α = β = 0, our regression line is not adding any

value at all
Null Hypotheses: α = 0
β = 0, α = 0
F-statistic

β = B, α = A

α=β=0
F-Statistic
β = 0, α = 0
F-statistic

β = B, α = A

High F-statistic => Yes

p-values and t-statistics tell us
whether individual parameter
coefficients are ‘good’
The F-statistic tells us whether a
entire regression line is ‘good’
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

Combined Regression Line:

y = A1D1 + A2D2 + Bx

D1 = 1 for males D2 = 1 for females

=0 for females =0 for males
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

Combined Regression Line:

y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 0 for males

y = A1x1 + A20 + Bx
= A1 + Bx
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx

Combined Regression Line:

y = A1D1 + A2D2 + Bx
D1 = 0 for females D2 = 1 for females

y = A1x0 + A2x1 + Bx

= A2 + Bx
Adding A Dummy Variable
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents

Combined Regression Line:

y = A1D1 + A2D2 + Bx
D1 = 1 for males D2 = 1 for females
=0 for females =0 for males
Given data with k groups, set up k-1
dummy variables and an intercept, or
k dummy variables with no intercept
Regression Without Intercept

Regression R2 can go Excel, Python and R all adjust

negative R2 formula in this case

Python statsmodel R2
Excel and R usually agree
sometimes differs