Professional Documents
Culture Documents
Chapter 6 Linear Regression Models 2021 Algorithmic Trading Methods
Chapter 6 Linear Regression Models 2021 Algorithmic Trading Methods
Chapter
6
INTRODUCTION
Regression analysis is a statistical technique used to model a relationship
between a dependent variable (known as the output variable, response
variable, or simply the y variable) and a set of independent variables or
variable (known as the explanatory factors, predictor variables, or simply
the x variables). The independent y variable is also referred to as the LHS
variable and the x variable is referred to as the RHS of the equation. The
goal of performing regression analysis is to uncover a set of statistically
significant x- variables and the sensitivity of the y variable to these input
variables. We are interested in learning how the y variable will change given
a change in the x variable. This sensitivity is known as the model betas and
is denoted as b. After a statistically significant relationship is uncovered,
analysts can forecast future outcome events.
Regression analysis is used in finance for many different purposes. For
example, regression analysis is used for asset pricing models, the capital
asset pricing model and arbitrage pricing theory, price prediction and
scenario analysis, risk modeling, volatility forecasting, and Monte Carlo
simulation.
More recently, regression models have made their way into the algorithmic
trading arena where they are used for transaction cost analysis, market
impact estimation, and portfolio optimization.
The usage of regression analysis in trading and finance serves four main
purposes:
1. Determining a statistically significant relationship between the y variable
and x variable(s).
2. Estimating model parameters, b.
3. Forecasting future y values.
4. Performing what-if and scenario analysis to understand how the y values
will change given different sets of x input variables.
1.5
0.5
-0.5
-1
0 0.2 0.4 0.6 0.8 1 1.2
A4. Constant variancedeach error term has the same variance, e.g., no
heteroskedasticity:
Regression Metrics
In performing regression analysis and evaluating the model, we need the
following set of statistical metrics and calculations:
bk ¼ model parameter valuesdestimated sensitivity of y to factor k
e ¼ regression errorddetermined from the estimation process
Se(bk) ¼ standard error of the estimated parameter bk
Syx ¼ standard error of the regression model using the set of explanatory
factors
Introduction 155
Regression sum of squares: the sum of the squared difference between the
predicted by value and the average y value:
X
n
2
SSR ¼ ybi yi
i¼1
Error sum of squares: the sum of the squared difference between the pre-
dicted by value and the actual y value:
X
n
2
SSE ¼ yi ybi
i¼1
Mean square regression: the sum of the squared difference between the
predicted by value and the average y value divided by the number of factors k:
Pn 2
ybi yi SSR
MSR ¼ i¼1
¼
k k
Mean sum of square errors: the sum of the squared error divided by the
degrees of freedom:
Pn 2
yi ybi SSE
MSE ¼ i¼1
¼
k nk1
Sum of squared X: the sum of the squared difference between the actual xk
value and its average value. For a simple linear regression model there is
only one x variable:
X
n
SSXk ¼ ðxki xk Þ2
i¼1
156 CHAPTER 6 Linear Regression Models
Sum of squared Y: the sum of the squared difference between the actual y
value and the average y value:
X
n
2
SSY ¼ yi yi
i¼1
Sum of squared XY: the sum of the squared difference for x and y
multiplied:
X
n
2
SSXk Y ¼ ðxki xk Þ2 yi yi
i¼1
LINEAR REGRESSION
There are two forms of linear regression models: simple linear regression
and multiple linear regression models. In a scenario with only a single in-
dependent predictor variable the regression analysis is a simple linear
regression model. In a scenario with more than one independent variable
the regression model analysis is a multiple linear regression model.
These are described as follows:
Here we have
y ¼ actual dependent value
xk ¼ kth explanatory factor
b0 ¼ actual constant term
bk ¼ actual sensitivity of y to factor xk
ε ¼ random market noise
In practice and in industry we are not provided with the true linear regres-
sion model, explanatory factors, or parameter values. We as analysts need to
determine a significant set of explanatory factors and estimate the parameter
values via statistical estimation. These statistical techniques are explained
below.
The simple linear regression model, i.e., the estimation regression model,
has the form:
by ¼ b0 þ b1 x1
Here we have
y ¼ actual dependent value
by ¼ estimated dependent variable
x1 ¼ explanatory factor
b0 ¼ intercept term
b1 ¼ sensitivity of y to factor x
e ¼ regression error term
The regression error term, e, is the difference between the actual y value and
the estimated by value. It also signifies the quantity of y that is not explained
by the explanatory factors. The regression error is calculated as follows:
e ¼ y by
vL X
¼2 ðy b0 b1 x1 Þðx1 Þ ¼ 0
vb1
4. Simplify the equations and bring the constant term to the RHS. This re-
sults in a system of linear equations:
X X X
b0 1 þ b1 x1 ¼ y
X X X
b0 x1 þ b1 x21 ¼ x1 y
5. Calculate the reduced matrix form of the set of linear equations. This is
used to simplify the mathematics required to solve for the model. The
reduced matrix form of the simple linear regression model is:
" P X #
n x1 b y
X P
0
¼ X
x1 x21 b1 x1 y
P
Notice that the upper right value is n because 1 ¼ n.
6. Solve for model parameters b0 and b1.
Here we have two equations and two unknowns. The parameters can be
solved via many different techniques such as substitution, row reduction,
Gaussian elimination, Cramer’s rule, as well as matrix multiplication
techniques.
The solution is:
b0 ¼ y b1 x1
Pn
i¼1 ðx1i x1 Þ yi y
b1 ¼ Pn
i¼1 ðx1i x1 Þ2
R2 Goodness of Fit
SSE SSR
Syx ¼ 1 ¼
SST SST
where
Syx
Seðb1 Þ ¼ pffiffiffiffiffiffiffiffiffiffi
SSX1
The underlying data for this analysis are shown in Table 6.1 and the OLS
regression results are shown in Table 6.2.
160 CHAPTER 6 Linear Regression Models
Multiple R 0.764982
R Square 0.585198
Adjusted R Square 0.572998
Standard Error (Syx) 0.026591
Observations 36
ANOVA
df SS MS F Significance F
Regression 1 0.033916 0.033916 47.966780 5.5522
Residual 34 0.024040 0.000707
Total 35 0.057956
Coefficients Std Error T-Stat P-Value Lower 95% Upper 95%
Intercept 0.004574 0.005471 0.836003 0.40899 0.006545 0.015693
X1 1.191801 0.172081 6.925805 5.5522 0.842090 1.541512
Here
y ¼ actual dependent value
by ¼ estimated dependent variable
xk ¼ kth explanatory factor
b0 ¼ intercept term
bk ¼ sensitivity of y to factor xk
e ¼ regression error term
The regression error term, e, is the difference between the actual y value and
the estimated by value. It also signifies the quantity of y that is not explained
by the explanatory factors. The regression error is calculated as follows:
e ¼ y by
Notice that this model is the same as the simple linear regression model but
with additional parameters and explanatory factors.
vL X
¼2 ðy b0 b1 x1 . bk xk Þðx1 Þ ¼ 0
vb1
vL X
¼2 ðy b0 b1 x1 . bk xk Þðxk Þ ¼ 0
vbk
4. Simplify the equations and bring the constant term to the RHS. This re-
sults in a system of linear equations:
X X X X
b0 1 þ b1 x1 þ . þ bk xk ¼ y
X X X X
b0 x1 þ b1 x21 þ . þ bk x1 xk ¼ x1 y
«
X X X X
b0 xk þ b1 x1 xk þ . þ bk x2 xk ¼ xk y
5. Calculate the reduced matrix form of the set of linear equations. This is
used to simplify the mathematics required to solve for the model. The
reduced matrix form of the simple linear regression model is:
P P X
2 n x1 . xk 3
32 3 2 y
X P P b0
6 x1 x21 . x1 xk 76 7 6 X 7
6 76 b1 7 ¼ 6 x1 y 7
6 54 « 5 4 7
4 « « 1 « « 5
X bk X
P P 2 xk y
xk x1 xk . x k
R2 Goodness of Fit
SSE SSR
Syx ¼ 1 ¼
SST SST
where
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2 ffi
x2i
Seðb1 Þ ¼ P P P $Syx
ð x21i Þð x22i Þ ð x1i x2i Þ2
Linear Regression 165
And
b2
TStatðb2 Þ ¼
Seðb2 Þ
where
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P 2 ffi
x1i
Seð2Þ ¼ P P P $Syx
ð x21i Þð x22i Þ ð x1i x2i Þ2
Table 6.3 Multiple Linear Regression Output With Three Variables (Variable X3 is Insignificant).
Regression Statistics
Multiple R 0.820723
R Square 0.673587
Adjusted R Square 0.642986
Standard Error (Syx) 0.024314
Observations 36
ANOVA
df SS MS F Significance F
Regression 3 0.039039 0.013013 22.011754 6.383118
Residual 32 0.018918 0.000591
Total 35 0.057956
Coefficients Std Error T-Stat P-Value Lower 95% Upper 95%
Intercept 0.006153 0.005721 1.075533 0.29017893 0.005500 0.017807
X1 1.290985 0.180894 7.136683 4.2360216 0.922516 1.659455
X2 0.470088 0.196004 2.398355 0.0224714 0.070840 0.869336
X3 0.135790 0.081979 1.656401 0.10741556 0.031196 0.302776
Therefore we must rerun this regression analysis using only x1, x2. This
model is:
by ¼ b0 þ b1 x1 þ b2 x2
Table 6.4 Multiple Linear Regression Output With Two Significant Variables (All Variables are Significant).
Regression Statistics
Multiple R 0.803493
R Square 0.645600
Adjusted R Square 0.624122
Standard Error (Syx) 0.024948
Observations 36
ANOVA
df SS MS F Significance F
Regression 2 0.037417 0.018708 30.057608 3.68671E-08
Residual 33 0.020540 0.000622
Total 35 0.057956
Coefficients Std Error T-Stat P-Value Lower 95% Upper 95%
Intercept 0.009382 0.005519 1.699894 0.09855929 0.001847 0.020611
X1 1.146426 0.162581 7.051400 4.5367 0.815652 1.477200
X2 0.476859 0.201072 2.371581 0.0237011 0.067775 0.885944
MATRIX TECHNIQUES
In matrix notation, the true regression model is written as:
y ¼ Xb þ ε
Estimate Parameters
The parameters of our regression model are estimated via OLS as follows.
This is as follows:
Step 1. Compute the residual sum of squares:
eT e ¼ ðy XbÞT ðy XbÞ
Therefore we have:
EðbÞ ¼ b
It is important to note that if EðeeT Þss2 $I, then the data are heteroskedas-
tic, e.g., it is not constant variance and violates one of our required regression
properties.
The standard error of the parameters is computed from the above matrix:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
R2 Statistic
b0 X 0 y ny2
R2 ¼
y0 y ny2
F-Statistic
b0 X 0 y ny2 =ðk 1Þ
F¼
ðy0 y b0 X 0 yÞ=ðn kÞ
Example: Log-Transformation
Let the relationship between the dependent variable Y and independent vari-
ables X1 and X2 follow a power function as follows:
y ¼ b0 xb1
1 x2 ε
b2
where lnðεÞwN 0; s2 .
170 CHAPTER 6 Linear Regression Models
It would be very difficult to estimate and solve for the model parameters b0,
b1, b2 via OLS and find the first-order conditions. This is because there is a
nonlinear relationship between the dependent variable y and the explanatory
variables x and parameters b0, b1, b2. However, it is possible to simplify this
model into a linearized form by taking a log transformation of the data as
follows:
Step 1: Take logs of both sides (natural logarithms):
lnðyÞ ¼ ln b0 xb1
1 x2 ε
b2
b1 ¼ a1
b2 ¼ a2
uþ12$v2
y¼e
Polynomial Regression Model 171
y ¼ b0 xb1
1 x2 ε
b2
Then, we can apply OLS techniques to the following model using adjusted
parameter variables to make the process easier to follow. This is:
lnðyÞ ¼ a0 þ a1 lnðx1 Þ þ a2 lnðx2 Þ þ e
where ewN 0; s2 .
If the results of the log-transformed regression analysis are a0 ¼ 6:25,
a1 ¼ 0:52, a2 ¼ 0:76, and Syx ¼ 0.21, then power function regression pa-
rameters are calculated as follows:
b0 ¼ a0 ¼ exp 6:25 þ0:5 $0:212 ¼ 529:96, b1 ¼ a1 ¼ 0:52,
b2 ¼ a2 ¼ 0:76, and Syx ¼ 0.25.
Therefore the power function best fit prediction equation is:
y ¼ 529:56$x0:52
1 $x2
0:76
by ¼ b0 þ b1 x þ b2 x2 þ b3 x3 þ . þ bh xh
In this model, the input variable x can be any value (both positive and nega-
tive) but the exponent of x must be positive integer values.
A polynomial model has many applications in trading and finance. For
example, a polynomial function of degree h ¼ 2 is known as a quadratic
model and is used for portfolio optimization. A higher degree polynomial
such as h ¼ 3 is known as a cubic model and is used in finance to model
and optimize complex portfolios; it is also often used as a loss function in
172 CHAPTER 6 Linear Regression Models
by ¼ b0 þ b1 x þ b2 x2 þ b3 x3
has reduced matrix form determined from the first-order conditions. This
reduced matrix is:
X 3
2 P P P 32 2 y
n x x2 3
x 3
b0 7
6X P P P 7 6
7 6X 7
6 x x2 x3 x4 766 7 6 xy 7
6X 7 b1 7 6 7
6 P P P 6 766 7 ¼ 6X 7
6 x2 x3 x4 x 7 4 b 5 6 7
4 5 2 4 x2
y 7
X P P P 9 5
x3 x4 x6 x b3 X
3
xy
The parameter values b0, b1, ., bh can be solved via matrix algebra or via
Cramer’s rule.
models must be positive (e.g., x > 0) and the exponents can be any value
(positive or negative, and integer or fraction).
An example of a fractional regression model with both integer and frac-
tional exponents is:
by ¼ b0 þ b1 x þ b2 x2 þ b3 x1=2
by ¼ b0 þ b1 x þ b2 x2 þ b3 x1=2
has reduced matrix form determined from the first-order conditions. This
reduced matrix is:
X 3
2 P P P 3 2 y
n x x2
x 1=2 2 3
b0 7
6 X P P P 7 6
7 6 X 7
6 x x2 x3 x3=2 7 6
76 b1 7 6 xy 7
6 X 7
6 P P P 5=2 7 6 7¼6 7
6
4 x2 x3 x4 x 7 6 7 6 X
74 b2 5 6 2 7
7
X 5 4 x y 5
P P P
x1=2 x3=2 x5=2 x b3 X
1=2
x y
The parameter values b0, b1, ., bh can be solved via matrix algebra or via
Cramer’s rule.