You are on page 1of 23

Linear Regression Models

Chapter
6
INTRODUCTION
Regression analysis is a statistical technique used to model a relationship
between a dependent variable (known as the output variable, response
variable, or simply the y variable) and a set of independent variables or
variable (known as the explanatory factors, predictor variables, or simply
the x variables). The independent y variable is also referred to as the LHS
variable and the x variable is referred to as the RHS of the equation. The
goal of performing regression analysis is to uncover a set of statistically
significant x- variables and the sensitivity of the y variable to these input
variables. We are interested in learning how the y variable will change given
a change in the x variable. This sensitivity is known as the model betas and
is denoted as b. After a statistically significant relationship is uncovered,
analysts can forecast future outcome events.
Regression analysis is used in finance for many different purposes. For
example, regression analysis is used for asset pricing models, the capital
asset pricing model and arbitrage pricing theory, price prediction and
scenario analysis, risk modeling, volatility forecasting, and Monte Carlo
simulation.
More recently, regression models have made their way into the algorithmic
trading arena where they are used for transaction cost analysis, market
impact estimation, and portfolio optimization.
The usage of regression analysis in trading and finance serves four main
purposes:
1. Determining a statistically significant relationship between the y variable
and x variable(s).
2. Estimating model parameters, b.
3. Forecasting future y values.
4. Performing what-if and scenario analysis to understand how the y values
will change given different sets of x input variables.

Algorithmic Trading Methods, Second Edition. https://doi.org/10.1016/B978-0-12-815630-8.00006-5


Copyright © 2021 Elsevier Inc. All rights reserved. 151
152 CHAPTER 6 Linear Regression Models

In this chapter, we present four different regression analysis techniques:


n Linear regression
n Log-linear regression
n Polynomial regression
n Fractional regression
These models are illustrated in Fig. 6.1. Fig. 6.1A is a linear relationship
with the form y ¼ b0 þ b1x1. Notice that this equation follows the familiar
shape of a straight line. Fig. 6.1B shows a log relationship between

(A) Linear Model


16
14
12
10
8
6
4
2
0
0 0.2 0.4 0.6 0.8 1 1.2
(B) Log-Linear Model
4
3.5
3
2.5
2
1.5
1
0.5
0
0 5 10 15 20 25 30 35 40 45

n FIGURE 6.1 Regression Models.


Introduction 153

(C) Polynomial Model


1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
(D) Fractional Model
2

1.5

0.5

-0.5

-1
0 0.2 0.4 0.6 0.8 1 1.2

n FIGURE 6.1 cont'd

dependent variable y and independent variable x. The form of this equation is


y ¼ b0 þ b1 ln(x1). Fig. 6.1C shows a polynomial relationship between
dependent variable y and a single x variable. The equation is
y ¼ b0 þ b1x þ b2x2 þ b3x3. Fig. 6.1D shows a fractional polynomial rela-
tionship between y and x. This equation is y ¼ b0 þ b1x þ b2x0.5 þ b3x1.5.
The difference between the polynomial regression model and fractional
regression model is that the polynomial model can include any value of
x, including both positive and negative values, but it can only have positive
integer exponents. The fractional polynomial model can have any exponent
value, including positive and negative values, and both integers and
fractions, but the fractional model is only defined for a positive value of x.
154 CHAPTER 6 Linear Regression Models

Linear Regression Requirements


A proper regression model and analysis needs to satisfy seven main
assumption properties. These are explained in detail in Gujarati (1988),
Kennedy (1998), and Greene (2000).
The main assumptions of the linear regression model are:
A1. Linear relationshipdbetween dependent variable and model
parameters:
y ¼ b0 þ b1 x1 þ . þ bk xk þ e

A2. Unbiased parameter valuesdthe estimated parameter values are unbi-


ased estimates of the turn parameter values and satisfy the following:
E(b) ¼ b0, E(b1) ¼ b1, ., E(bk) ¼ bk
A3. Error term mean zerodthe expected value of the error term is zero:
EðeÞ ¼ 0

A4. Constant variancedeach error term has the same variance, e.g., no
heteroskedasticity:

Varðek Þ ¼ s2 for all k

A5. Independent error t termdno autocorrelation or correlation of any


degree:
E(ek ekt) ¼ 0 for all lagged time periods t
A6. Errors are independent of explanatory factors:
Cov(e, xk) ¼ 0 for all factors k
A7. Explanatory factors are independent:
Cov(xj, xk) ¼ 0 for all factors j and k

Regression Metrics
In performing regression analysis and evaluating the model, we need the
following set of statistical metrics and calculations:
bk ¼ model parameter valuesdestimated sensitivity of y to factor k
e ¼ regression errorddetermined from the estimation process
Se(bk) ¼ standard error of the estimated parameter bk
Syx ¼ standard error of the regression model using the set of explanatory
factors
Introduction 155

R2 ¼ goodness of fit (the percentage of overall variance explained by the


model)
T-stat ¼ critical value for the estimated parameter
F-stat ¼ critical value for the entire model
To assist in our calculations, and to make the math easier to follow, we
introduce the following terms:
Total sum of squares: the sum of the squared difference between the actual
y value and the average y value:
X
n
 2
SST ¼ yi  yi
i¼1

Regression sum of squares: the sum of the squared difference between the
predicted by value and the average y value:
X
n
 2
SSR ¼ ybi  yi
i¼1

Error sum of squares: the sum of the squared difference between the pre-
dicted by value and the actual y value:
X
n
 2
SSE ¼ yi  ybi
i¼1

Mean square regression: the sum of the squared difference between the
predicted by value and the average y value divided by the number of factors k:
Pn  2
ybi  yi SSR
MSR ¼ i¼1
¼
k k
Mean sum of square errors: the sum of the squared error divided by the
degrees of freedom:
Pn  2
yi  ybi SSE
MSE ¼ i¼1
¼
k nk1
Sum of squared X: the sum of the squared difference between the actual xk
value and its average value. For a simple linear regression model there is
only one x variable:
X
n
SSXk ¼ ðxki  xk Þ2
i¼1
156 CHAPTER 6 Linear Regression Models

Sum of squared Y: the sum of the squared difference between the actual y
value and the average y value:
X
n
 2
SSY ¼ yi  yi
i¼1

Sum of squared XY: the sum of the squared difference for x and y
multiplied:
X
n
 2
SSXk Y ¼ ðxki  xk Þ2 yi  yi
i¼1

LINEAR REGRESSION
There are two forms of linear regression models: simple linear regression
and multiple linear regression models. In a scenario with only a single in-
dependent predictor variable the regression analysis is a simple linear
regression model. In a scenario with more than one independent variable
the regression model analysis is a multiple linear regression model.
These are described as follows:

True Linear Regression Model


The true linear relationship model has the form:
y ¼ b0 þ b1 x1 þ . þ bk xk þ ε

Here we have
y ¼ actual dependent value
xk ¼ kth explanatory factor
b0 ¼ actual constant term
bk ¼ actual sensitivity of y to factor xk
ε ¼ random market noise
In practice and in industry we are not provided with the true linear regres-
sion model, explanatory factors, or parameter values. We as analysts need to
determine a significant set of explanatory factors and estimate the parameter
values via statistical estimation. These statistical techniques are explained
below.

Simple Linear Regression Model


A simple linear model has the form:
y ¼ b0 þ b1 x1 þ e
Linear Regression 157

The simple linear regression model, i.e., the estimation regression model,
has the form:
by ¼ b0 þ b1 x1

Here we have
y ¼ actual dependent value
by ¼ estimated dependent variable
x1 ¼ explanatory factor
b0 ¼ intercept term
b1 ¼ sensitivity of y to factor x
e ¼ regression error term
The regression error term, e, is the difference between the actual y value and
the estimated by value. It also signifies the quantity of y that is not explained
by the explanatory factors. The regression error is calculated as follows:
e ¼ y  by

Solving the Simple Linear Regression Model


The goal of regression analysis is to calculate the best fit regression equation
so that that model can be used for analysis and forecasting needs. Solving
the simple linear regression model is a three-step process consisting of:
Step 1: Estimate Model Parameters.
Step 2: Evaluate model performance statistics.
Step 3: Test for statistical significance of factors.

Step 1: Estimate Model Parameters


The linear regression model parameters are estimated using the ordinary
least squares (OLS) technique. This process is as follows:
1. Define a loss function L to be the sum of the squared error for all obser-
vations as follows:
X
n
 2
L¼ yi  byi
i¼0

2. Substitute the actual regression equation for by as follows:


X
n
L¼ ðyi  ðb0 þ b1 x1 ÞÞ2
i¼0
158 CHAPTER 6 Linear Regression Models

This can be rewritten as follows:


X
n
L¼ ðyi  b0  b1 x1 Þ2
i¼0

3. Estimate model parameters via finding first-order conditions for all


parameters:
vL X
¼2 ðy  b0  b1 x1 Þð1Þ ¼ 0
vb0

vL X
¼2 ðy  b0  b1 x1 Þðx1 Þ ¼ 0
vb1
4. Simplify the equations and bring the constant term to the RHS. This re-
sults in a system of linear equations:
X X X
b0 1 þ b1 x1 ¼ y
X X X
b0 x1 þ b1 x21 ¼ x1 y

5. Calculate the reduced matrix form of the set of linear equations. This is
used to simplify the mathematics required to solve for the model. The
reduced matrix form of the simple linear regression model is:
" P  X #
n x1  b  y
X P
0
¼ X
x1 x21 b1 x1 y
P
Notice that the upper right value is n because 1 ¼ n.
6. Solve for model parameters b0 and b1.
Here we have two equations and two unknowns. The parameters can be
solved via many different techniques such as substitution, row reduction,
Gaussian elimination, Cramer’s rule, as well as matrix multiplication
techniques.
The solution is:
b0 ¼ y  b1 x1
Pn  
i¼1 ðx1i  x1 Þ yi  y
b1 ¼ Pn
i¼1 ðx1i  x1 Þ2

Step 2: Evaluate Model Performance Statistics


The next step is to compute the model performance statistics. This consists
of computing the R2 goodness of fit and the Syx standard error of the regres-
sion model. These are computed as follows:
Linear Regression 159

Standard Error of the Regression Model


rffiffiffiffiffiffiffiffiffiffiffi
SSE
Syx ¼
n2

R2 Goodness of Fit
SSE SSR
Syx ¼ 1  ¼
SST SST

Step 3: Test for Statistical Significance of Factors


We perform a hypothesis test to determine if the factors are statistically sig-
nificant and if they should be included in the regression model. This step
consists of calculating the T-stat and F-stat. These are calculated as follows:

T-test: Hypothesis Test:


b1
TStatðb1 Þ ¼
Seðb1 Þ

where
Syx
Seðb1 Þ ¼ pffiffiffiffiffiffiffiffiffiffi
SSX1

F-test: Hypothesis Test:


MSR SSR=1
FStatðb1 Þ ¼ ¼
MSE SSE=ðn  1Þ

For a simple linear regression model, it is redundant to perform both a T-test


and an F-test of the data. If we find that the x variable is statistically signif-
icant from the T-test we will always reach the same conclusion using the F-
test and vice versa. Thus for a simple linear regression analysis, we will often
only perform a T-test. For a multiple linear regression analysis we need to
perform both T-test and F-test analyses.

Example: Simple Linear Regression


An analyst is asked to calculate the following simple linear regression to es-
timate price returns y from a variable x1:
by ¼ b0 þ b1 x1

The underlying data for this analysis are shown in Table 6.1 and the OLS
regression results are shown in Table 6.2.
160 CHAPTER 6 Linear Regression Models

Table 6.1 Linear Regression Data.


Month Y X1 X2 X3

1 0.0206 0.0350 0.0156 0.0502


2 0.0429 0.0140 0.0011 0.0120
3 0.0871 0.0311 0.0151 0.0555
4 0.0159 0.0342 0.0115 0.0398
5 0.0080 0.0092 0.0306 0.0327
6 0.0605 0.0232 0.0098 0.0200
7 0.0349 0.0279 0.0331 0.0184
8 0.0790 0.0622 0.0060 0.0972
9 0.0048 0.0247 0.0271 0.0468
10 0.0308 0.0082 0.0182 0.0524
11 0.0108 0.0129 0.0046 0.0242
12 0.0516 0.0053 0.0142 0.0439
13 0.0088 0.0145 0.0160 0.1526
14 0.0248 0.0072 0.0180 0.0660
15 0.0290 0.0373 0.0010 0.0162
16 0.0720 0.0358 0.0001 0.0607
17 0.0160 0.0276 0.0196 0.0093
18 0.1161 0.0655 0.0088 0.0857
19 0.0163 0.0182 0.0003 0.0845
20 0.0201 0.0205 0.0056 0.0551
21 0.0328 0.0179 0.0323 0.0746
22 0.0146 0.0018 0.0096 0.0908
23 0.0162 0.0204 0.0374 0.0288
24 0.0404 0.0012 0.0084 0.0847
25 0.0434 0.0362 0.0166 0.0595
26 0.0633 0.0537 0.0572 0.0125
27 0.0215 0.0183 0.0305 0.0275
28 0.0299 0.0377 0.0007 0.0339
29 0.0563 0.0051 0.0008 0.0465
30 0.0723 0.0506 0.0449 0.0564
31 0.0259 0.0149 0.0058 0.0619
32 0.0062 0.0064 0.0044 0.0261
33 0.1164 0.0856 0.0239 0.0042
34 0.0318 0.0076 0.0454 0.0013
35 0.0242 0.0113 0.0309 0.0170
36 0.0544 0.0300 0.0150 0.0496
Linear Regression 161

Table 6.2 Simple Linear Regression Output.


Regression Statistics

Multiple R 0.764982
R Square 0.585198
Adjusted R Square 0.572998
Standard Error (Syx) 0.026591
Observations 36
ANOVA
df SS MS F Significance F
Regression 1 0.033916 0.033916 47.966780 5.5522
Residual 34 0.024040 0.000707
Total 35 0.057956
Coefficients Std Error T-Stat P-Value Lower 95% Upper 95%
Intercept 0.004574 0.005471 0.836003 0.40899 0.006545 0.015693
X1 1.191801 0.172081 6.925805 5.5522 0.842090 1.541512

This regression has R2 ¼ 0.585198, which is a very strong goodness of fit


and has a regression error of Syx ¼ 0.026591. The x1 variable is statistically
significant with T-stat ¼ 6.925805, which is significant at the p-value 
0.001. Because the model has a strong goodness of fit and the variables
are significant, the model can be used to predict price returns.
The best fit regression prediction equation is:
by ¼ 0:004574 þ 1:191801$x1

Therefore if we have x1 ¼ 0.025, the expected y price change is:


by ¼ 0:004574 þ 1:191801$0:025 ¼ 0:034369

Multiple Linear Regression Model


A multiple linear model has the form:
y ¼ b0 þ b1 x1 þ b2 x2 þ . þ bk xk þ e

The simple linear prediction regression model has the form:


by ¼ b0 þ b1 x1 þ b2 x2 þ . þ bk xk
162 CHAPTER 6 Linear Regression Models

Here
y ¼ actual dependent value
by ¼ estimated dependent variable
xk ¼ kth explanatory factor
b0 ¼ intercept term
bk ¼ sensitivity of y to factor xk
e ¼ regression error term
The regression error term, e, is the difference between the actual y value and
the estimated by value. It also signifies the quantity of y that is not explained
by the explanatory factors. The regression error is calculated as follows:
e ¼ y  by

Notice that this model is the same as the simple linear regression model but
with additional parameters and explanatory factors.

Solving the Multiple Linear Regression Model


The goal of regression analysis is to calculate the best fit regression equation
so that that model can be used for analysis and forecasting needs. Solving
the simple linear regression model is a three-step process consisting of:
Step 1: Estimate model parameters.
Step 2: Calculate model performance statistics.
Step 3: Test for statistical significance of factors.

Step 1: Estimate Model Parameters


The linear regression model parameters are estimated using the OLS tech-
nique. This process is as follows:
1. Define a loss function L to be the sum of the squared error for all obser-
vations as follows:
X
n
 2
L¼ yi  byi
i¼0

2. Substitute the actual regression equation for by as follows:


X
n
L¼ ðyi  ðb0 þ b1 x1 þ . þ bk xk ÞÞ2
i¼0
Linear Regression 163

This can be rewritten as follows:


X
n
L¼ ðyi  b0  b1 x1  .  bk xk Þ2
i¼0

3. Estimate model parameters via finding first-order conditions for all


parameters:
vL X
¼2 ðy  b0  b1 x1  .  bk xk Þð1Þ ¼ 0
vb0

vL X
¼2 ðy  b0  b1 x1  .  bk xk Þðx1 Þ ¼ 0
vb1

vL X
¼2 ðy  b0  b1 x1  .  bk xk Þðxk Þ ¼ 0
vbk
4. Simplify the equations and bring the constant term to the RHS. This re-
sults in a system of linear equations:
X X X X
b0 1 þ b1 x1 þ . þ bk xk ¼ y
X X X X
b0 x1 þ b1 x21 þ . þ bk x1 xk ¼ x1 y

«
X X X X
b0 xk þ b1 x1 xk þ . þ bk x2 xk ¼ xk y

5. Calculate the reduced matrix form of the set of linear equations. This is
used to simplify the mathematics required to solve for the model. The
reduced matrix form of the simple linear regression model is:
P P X
2 n x1 . xk 3
32 3 2 y
X P P b0
6 x1 x21 . x1 xk 76 7 6 X 7
6 76 b1 7 ¼ 6 x1 y 7
6 54 « 5 4 7
4 « « 1 « « 5
X bk X
P P 2 xk y
xk x1 xk . x k

6. Solve for model parameters.


In the reduced matrix form we have k-equations and k-unknown. To solve
for these parameter values, it is required that the x variables be independent.
Otherwise, it is not possible to solve this system of equations. In mathema-
ticians’ speak, the requirement is that the matrix has full rank.
164 CHAPTER 6 Linear Regression Models

These parameter values can be solved via numerous different techniques


such as substitution, row reduction, Gaussian elimination, Cramer’s rule,
as well as matrix multiplication techniques.
The solution for a two-variable three-parameter model is:
b0 ¼ y  b1 x1  b2 x2
P P 2  P P
ð yi x1i Þ x2i  ð yi x2i Þð x1i x2i Þ
b1 ¼ P 2 P 2 P
ð x1i Þð x2i Þ  ð x1i x2i Þ2
P P 2  P P
ð yi x2i Þ x1i  ð yi x1i Þð x1i x2i Þ
b2 ¼ P 2 P 2 P
ð x1i Þð x2i Þ  ð x1i x2i Þ2

Step 2: Calculate Model Performance Statistics


The next step is to compute the model performance statistics. This consists
of computing the R2 goodness of fit and the Syx standard error of the regres-
sion model. These are computed as follows:

Standard Error of the Regression Model


rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
SSE
Syx ¼
nk1

R2 Goodness of Fit
SSE SSR
Syx ¼ 1  ¼
SST SST

Step 3: Test for Statistical Significance of Factors


We perform a hypothesis test to determine if the factors are statistically sig-
nificant and if they should be included in the regression model. This step
consists of calculating the T-stat and F-stat. These are calculated as follows:

T-test: Hypothesis Test:


b1
TStatðb1 Þ ¼
Seðb1 Þ

where
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2 ffi
x2i
Seðb1 Þ ¼ P P P $Syx
ð x21i Þð x22i Þ  ð x1i x2i Þ2
Linear Regression 165

And
b2
TStatðb2 Þ ¼
Seðb2 Þ

where
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2 ffi
x1i
Seð2Þ ¼ P P P $Syx
ð x21i Þð x22i Þ  ð x1i x2i Þ2

F-test: Hypothesis Test:


MSR SSR=k
FStat ¼ ¼
MSE SSE=ðn  kÞ

For a simple linear regression model, it is redundant to perform both a T-test


and an F-test of the data. However, it is required to perform both a T-test and
an F-test with a multiple linear regression. A regression can only be stated to
be statistically significant if both T-test and F-test are accepted at the desired
significance level. A 5% significance, e.g., 95% confidence interval, is the
more common significance level, but analysts need to define these levels
based on their needs.

Example: Multiple Linear Regression


An analyst is asked to revisit the previous simple linear regression example
above, and include two additional explanatory variables x2 and x3. These
data are shown in Table 6.1. This multiple regression model has the form:
by ¼ b0 þ b1 x1 þ b2 x2 þ b2 x3

The regression results are show in Table 6.3.


At first glance, this regression appears to be an improvement over the sim-
ple linear regression with only one input variable. For example, this model
has higher R2 ¼ 0.674 compared to R2 ¼ 0.585, and a smaller regression er-
ror Syx ¼ 0.0243 compared to Syx ¼ 0.0266. The model has a significant
F-stat but the x3 variable is not significant at a ¼ 0:05 because the
jT-statj < 2 and the p-value > 0.05. Therefore variable x3 is not a significant
predictor of y.
It is important to note that if a variable is not found to be significant, then
analysts needs to eliminate that variable from the regression data and rerun
the regression analysis. It is not correct to use the results from a regression
analysis where a variable is found to be insignificant.
166 CHAPTER 6 Linear Regression Models

Table 6.3 Multiple Linear Regression Output With Three Variables (Variable X3 is Insignificant).
Regression Statistics

Multiple R 0.820723
R Square 0.673587
Adjusted R Square 0.642986
Standard Error (Syx) 0.024314
Observations 36
ANOVA
df SS MS F Significance F
Regression 3 0.039039 0.013013 22.011754 6.383118
Residual 32 0.018918 0.000591
Total 35 0.057956
Coefficients Std Error T-Stat P-Value Lower 95% Upper 95%
Intercept 0.006153 0.005721 1.075533 0.29017893 0.005500 0.017807
X1 1.290985 0.180894 7.136683 4.2360216 0.922516 1.659455
X2 0.470088 0.196004 2.398355 0.0224714 0.070840 0.869336
X3 0.135790 0.081979 1.656401 0.10741556 0.031196 0.302776

Therefore we must rerun this regression analysis using only x1, x2. This
model is:
by ¼ b0 þ b1 x1 þ b2 x2

The results from this regression are shown in Table 6.4.


Analysis of this regression is also an improvement over the simple linear
regression with only one input variable. This model has higher
R2 ¼ 0.6456 compared to R2 ¼ 0.585, and a smaller regression error
Syx ¼ 0.0249 compared to Syx ¼ 0.0266. The model has a significant F-
stat and significant T-stat for all variables. Therefore this is an acceptable
regression model.
The best fit regression prediction equation is:
by ¼ 0:009382 þ 1:146426$x1 þ 0:0476859$x2

Therefore if we have x1 ¼ 0.025 and x2 ¼ 0.005, the expected y price


change is:
by ¼ 0:009382 þ 1:146426$0:025 þ 0:0476859$0:005 ¼ 0:040427
Matrix Techniques 167

Table 6.4 Multiple Linear Regression Output With Two Significant Variables (All Variables are Significant).
Regression Statistics

Multiple R 0.803493
R Square 0.645600
Adjusted R Square 0.624122
Standard Error (Syx) 0.024948
Observations 36
ANOVA
df SS MS F Significance F
Regression 2 0.037417 0.018708 30.057608 3.68671E-08
Residual 33 0.020540 0.000622
Total 35 0.057956
Coefficients Std Error T-Stat P-Value Lower 95% Upper 95%
Intercept 0.009382 0.005519 1.699894 0.09855929 0.001847 0.020611
X1 1.146426 0.162581 7.051400 4.5367 0.815652 1.477200
X2 0.476859 0.201072 2.371581 0.0237011 0.067775 0.885944

MATRIX TECHNIQUES
In matrix notation, the true regression model is written as:
y ¼ Xb þ ε

The estimation regression model is:


by ¼ Xb

The vector of error terms (also known as vector of residuals) is then:


e ¼ y  Xb

Estimate Parameters
The parameters of our regression model are estimated via OLS as follows.
This is as follows:
Step 1. Compute the residual sum of squares:

eT e ¼ ðy  XbÞT ðy  XbÞ

Step 2. Estimate the parameters b


b via differentiating and solving for the first-
order condition yields:
1
b ¼ ðX T XÞ X T y
168 CHAPTER 6 Linear Regression Models

Compute Standard Errors of b


This is calculated by computing the covariance matrix of b
b. We follow the
approach from Greene (2000)and Mittelhammer, Judge, and Miller (2000).
This is as follows:
Step 1. Start with the estimated b from above and substitute for y:
1 1 1 1
b ¼ ðX T XÞ X T y ¼ ðX T XÞ X T ðXb þ eÞ ¼ ðX T XÞ X T Xb þ ðX T XÞ X T e
1 1
¼ Ib þ ðX T XÞ X T e ¼ b þ ðX T XÞ X T e

Therefore our estimated parameters are:


1
b ¼ b þ ðX T XÞ X T e

Step 2. Computed expected b


b as follows:
   
1 1
EðbÞ ¼ E b þ ðX T XÞ X T ε ¼ EðbÞ þ E ðX T XÞ X T e
1
¼ EðbÞ þ ðX T XÞ X T EðeÞ
1
¼ b þ ðX T XÞ X T $0 ¼ b

Therefore we have:
EðbÞ ¼ b

which states that b is an unbiased estimate of b.


Step 3. Compute the covariance matrix of b as follows:
    T 
1 1
CovðbÞ ¼ E ðb  bÞðb  bÞT ¼ E ðX T XÞ X T e ðX T XÞ X T e
 
1 1
¼ E ðX T XÞ X T eeT XðX T XÞ
1 1
¼ ðX T XÞ X T EðeeT ÞXðX T XÞ
1   1
¼ ðX T XÞ X T s2 $ I XðX T XÞ
1 1 1
¼ s2 $ðX T XÞ X T XðX T XÞ ¼ s2 $IðX T XÞ
1
¼ s2 ðX T XÞ

It is important to note that if EðeeT Þss2 $I, then the data are heteroskedas-
tic, e.g., it is not constant variance and violates one of our required regression
properties.
The standard error of the parameters is computed from the above matrix:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

SeðbÞ ¼ diag s2 ðX T XÞ1


Log Regression Model 169

R2 Statistic
b0 X 0 y  ny2
R2 ¼
y0 y  ny2

The coefficient of determination will be between 0 and 1. The closer the


value to one, the better the fit of the model.

F-Statistic
 
b0 X 0 y  ny2 =ðk  1Þ

ðy0 y  b0 X 0 yÞ=ðn  kÞ

LOG REGRESSION MODEL


A log-regression model is a regression equation where one or more of the
variables are linearized via a log transformation. Once linearized, the regres-
sion parameters can be estimated following the OLS techniques above. It
allows us to transform a complex nonlinear relationship into a simpler linear
model that can be easily evaluated using direct and standard techniques.
Log-regression models can be grouped into three categories: (1) linear-log
model where we transform the x explanatory variables using logs, (2) log-
linear model where we transform the y-dependent variable using logs, and
(3) log-log model where both the y-dependent variable and the x explanatory
factors are both transformed using logs.
For example, if Y and X refer to the actual data observations, then our four
categories of log transformations are:
1. Linear: Y ¼ b0 þ b1$X þ e
2. Linear-log: Y ¼ b0 þ b1$log(X) þ e
3. Log-linear: log(Y) ¼ b0 þ b1$X þ e
4. Log-log: log(Y) ¼ b0 þ b1$log(X) þ e
As stated, the parameters of these models can be estimated directly from our
OLS technique provided above.

Example: Log-Transformation
Let the relationship between the dependent variable Y and independent vari-
ables X1 and X2 follow a power function as follows:

y ¼ b0 xb1
1 x2 ε
b2

 
where lnðεÞwN 0; s2 .
170 CHAPTER 6 Linear Regression Models

It would be very difficult to estimate and solve for the model parameters b0,
b1, b2 via OLS and find the first-order conditions. This is because there is a
nonlinear relationship between the dependent variable y and the explanatory
variables x and parameters b0, b1, b2. However, it is possible to simplify this
model into a linearized form by taking a log transformation of the data as
follows:
Step 1: Take logs of both sides (natural logarithms):
 
lnðyÞ ¼ ln b0 xb1
1 x2 ε
b2

Step 2: Simplify the RHS:


lnðyÞ ¼ lnðb0 Þ þ b1 lnðx1 Þ þ b2 lnðx2 Þ þ lnðεÞ

Step 3: Rewrite the equation using new parameters a0 ; a1 ; a2 :


lnðyÞ ¼ a0 þ a1 lnðx1 Þ þ a2 lnðx2 Þ þ e

Step 4: Run OLS regression on the transformed equation to solve for


a0 ; a1 ; a2 .
This model can now be solved using the techniques above. We convert from
a0 ; a1 ; a2 to b0, b1, b2 as follows:

1
b0 ¼ exp a0 þ $ S2yx
2

b1 ¼ a1

b2 ¼ a2

It is important to note here that the constant parameter b0 requires an adjust-


ment using the regression variance term, e.g., the regression error term
squared. This is a required step in these transformations of a log-normal
distribution.
For example, if Y has a log-normal distribution with mean u and variance v2,
that is,
 
ywlogNormal u; v2

then the expected value of E(y) is calculated as follows:


1
EðlogðyÞÞ ¼ u þ $v2
2
And therefore we have:

uþ12$v2
y¼e
Polynomial Regression Model 171

Example: Log-Linear Transformation


An analyst is using a power function model to estimate market impact cost.
This model is as follows:

y ¼ b0 xb1
1 x2 ε
b2

where y is the dependent variable, x1 and x2 are the independent explanatory


factors, b0, b1, b2 are the model parameters, and ε is the error term with
 
lnðεÞwN 0; s2 . These model parameters can be estimated using a log
transformation of the data as follows:
lnðyÞ ¼ lnðb0 Þ þ b1 lnðx1 Þ þ b2 lnðx2 Þ þ lnðεÞ

Then, we can apply OLS techniques to the following model using adjusted
parameter variables to make the process easier to follow. This is:
lnðyÞ ¼ a0 þ a1 lnðx1 Þ þ a2 lnðx2 Þ þ e
 
where ewN 0; s2 .
If the results of the log-transformed regression analysis are a0 ¼ 6:25,
a1 ¼ 0:52, a2 ¼ 0:76, and Syx ¼ 0.21, then power function regression pa-
rameters are calculated as follows:

b0 ¼ a0 ¼ exp 6:25 þ0:5 $0:212 ¼ 529:96, b1 ¼ a1 ¼ 0:52,
b2 ¼ a2 ¼ 0:76, and Syx ¼ 0.25.
Therefore the power function best fit prediction equation is:

y ¼ 529:56$x0:52
1 $x2
0:76

POLYNOMIAL REGRESSION MODEL


A polynomial regression model is a model where the dependent variable is a
function of a single independent variable x. A polynomial regression model
has the form:

by ¼ b0 þ b1 x þ b2 x2 þ b3 x3 þ . þ bh xh

In this model, the input variable x can be any value (both positive and nega-
tive) but the exponent of x must be positive integer values.
A polynomial model has many applications in trading and finance. For
example, a polynomial function of degree h ¼ 2 is known as a quadratic
model and is used for portfolio optimization. A higher degree polynomial
such as h ¼ 3 is known as a cubic model and is used in finance to model
and optimize complex portfolios; it is also often used as a loss function in
172 CHAPTER 6 Linear Regression Models

place of a linear constraint. A polynomial can also be used to approximate


more complex mathematical functions such as are used in algorithmic
trading and advanced portfolio optimization.
In a polynomial regression model there is a single x variable and multiple
explanatory factors that are functions of this x variable. For example, the
explanatory factors of this hth degree polynomial are (x, x2, x3, ., xh).
Above we stated that a regression analysis requires the explanatory factors
to be independent of one another. In the case of this polynomial, there is mul-
ticollinearity across all the factors. That is, the set of factors are all correlated.
However, because these factors are not perfectly correlated we are able to
solve for model parameters. In a situation where there is multicollinearity
embedded in the x variables, we might still be able to estimate parameters
and the best fit prediction model. The only issues that we cannot determine
are the true sensitivity and cause and effect between the dependent y variable
and x variables. For many of our needs, however, we may only need the pa-
rameters for prediction purposes.
It is important to note that the polynomial regression model is a linear
model because the dependent variable y is defined as a linear function of
the parameters. This allows us to estimate the model parameters using
OLS techniques.
For example, the following fractional regression model:

by ¼ b0 þ b1 x þ b2 x2 þ b3 x3

has reduced matrix form determined from the first-order conditions. This
reduced matrix is:
X 3
2 P P P 32 2 y
n x x2 3
x 3
b0 7
6X P P P 7 6
7 6X 7
6 x x2 x3 x4 766 7 6 xy 7
6X 7 b1 7 6 7
6 P P P 6 766 7 ¼ 6X 7
6 x2 x3 x4 x 7 4 b 5 6 7
4 5 2 4 x2
y 7
X P P P 9 5
x3 x4 x6 x b3 X
3
xy

The parameter values b0, b1, ., bh can be solved via matrix algebra or via
Cramer’s rule.

FRACTIONAL REGRESSION MODEL


A fractional regression model, also referred to as a fractional polynomial
regression model, is a model where the dependent variable is a function
of a single independent variable x. The value of x in fractional regression
Fractional Regression Model 173

models must be positive (e.g., x > 0) and the exponents can be any value
(positive or negative, and integer or fraction).
An example of a fractional regression model with both integer and frac-
tional exponents is:

by ¼ b0 þ b1 x þ b2 x2 þ b3 x1=2

A more complex fraction regression model with additional fractional


exponent terms is:

by ¼ b0 þ b1 x þ b2 x2 þ b3 x1=2 þ b4 x3=2 þ b5 x5=3

These fractional polynomial regression models have become increasingly


popular in finance and especially in algorithmic trading. These models
are used to estimate market impact cost and have been embedded in algo-
rithmic trading engines. These models are also being used as approxima-
tions to more advanced and complex equations. In many situations, these
fractional regression models provide an exception level of accuracy and
they can be solved exponentially faster than the more complex models,
thus proving very beneficial for algorithmic trading, optimization, and
many machine learning algorithms.
Like the polynomial regression model, the fractional polynomial model is a
linear model because the dependent variable y is defined as a linear function
of the parameters. This allows us to estimate the model parameters using
OLS techniques.
For example, the following fractional regression model:

by ¼ b0 þ b1 x þ b2 x2 þ b3 x1=2

has reduced matrix form determined from the first-order conditions. This
reduced matrix is:
X 3
2 P P P 3 2 y
n x x2
x 1=2 2 3
b0 7
6 X P P P 7 6
7 6 X 7
6 x x2 x3 x3=2 7 6
76 b1 7 6 xy 7
6 X 7
6 P P P 5=2 7 6 7¼6 7
6
4 x2 x3 x4 x 7 6 7 6 X
74 b2 5 6 2 7
7
X 5 4 x y 5
P P P
x1=2 x3=2 x5=2 x b3 X
1=2
x y

The parameter values b0, b1, ., bh can be solved via matrix algebra or via
Cramer’s rule.

You might also like