Professional Documents
Culture Documents
Financial Forecasting
With
Support Vector Regression
Rocko Chen
(rchen@publicist.com)
Acknowledgement
This study was made possible via guidance and support of the following persons:
Dr. Jiling Cao, my supervisor, for his vital facilitation and directives.
All School of Computing & Mathematical Sciences faculty members and Staff.
Joshua Thompson, friend, professional trader, for the pragmatic perspective and much
needed motivation.
Most especially to the countless Ph.D. researchers whose works I have studied throughout
the past year.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 3
Table of Content
Abstract ........................................................................................................................................ 5
Reference.....................................................................................................................................41
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 5
Abstract
This study explores Support Vector Machines (SVMs) for the purpose of forecasting the
ASX200 index. SVMs have become popular within the financial forecasting realm for its
unique ability, Structural Risk Minimization (SRM).
The paper commences with a review of relevant numerical optimization concepts. Going
from basic notions of Lagrangian Functions to quadratic programming, necessary
foundations are laid for the reader to comprehend key ideas of SVR.
SVR details ensue. This section explores SVR core analytical and SRM processes. Key
theories are explained with examples.
The final section involves an empirical test where SVR attempts to valuate the ASX200 stock
index and forecast next day return. The test applies roughly 7 years of daily closing prices of
several predictive variables for model establishment. Optimal test results follow an adaptive
training approach. Test findings along with SRM effectiveness are analysed hereafter.
Disclaimer-
Rocko Chen retains all rights to the above content in perpetuity. Chen grants others to copy
and re-use the document for non-commercial purposes provided that Chen is given credit.
Further, the content remains primarily a subject of research and the author does not take
liability for any potential losses via utilization. April 28, 2009
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 6
Chapter I. Introduction
Developed by Vladimir Vapnik [15], SVMs distinguish from popular machine learning
algorithms for several unique characteristics. They offer robust classifications/predictions
while maintaining structural risk minimization.
Apparently machines have learned more efficiently than humans by significant numbers.
Even in the machine learning community, SVMs have displayed superiority over processes
such as Discriminant Analysis (linear or quadratic), Logit Models, Back Propagation Neural
Networks and others [13]. These qualities therefore make SVR desirable for financial
forecasting.
Though unlikely random according to growing academic evidence, the financial markets
represent high noise, non-linear, and non-stationary processes. Throughout the years many
researchers have attempted to make predictions with various statistical methods, including
SVR, with some promise. However, some fundamental assumptions had limited applicability
of the studies.
The empirical study examines SVR (assumption based) weaknesses and attempts to remedy
them via data selection strategies. Consequent findings point to some interesting elements.
On the whole, SVR could potentially contribute significant value for the professional
financial instrument trader.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 7
A local solution, x*, serves as a vector within neighbourhood N, inside Ω, such that
f(x)≥ f(x*).
An Active Set, A(x), at any feasible x consists of both Equality and Inequality constraint
indices for which c i (x)= 0,
i.e. A(x)= EỤ{i within I| c i (x)= o}
e.g. at a feasible point x, (i within I) is active if c i (x)=0
(i within I) is inactive if c i (x)>0
2.2 The Lagrangian Function
L(x, λ 1 ) = f(x) - λ 1 c 1 (x), where λ 1 = Lagrangian Multiplier
then, ∇ L(x, λ 1 )= ∇ f(x)- λ 1 ∇ c 1 (x), also applies to ∇ f(x*)= λ 1 ∇ c 1 (x*)
2.2.1 Example with Lagrangian Function
min x 1 + x 2 , S.T. (Such That) 8- x 1 2 - x 2 2 ≥ 0
Feasible region: interior & border of circle x 1 2 + x 2 2
=8
Constraint normal: ∇ c 1 (x) points toward interior at boundary pts
Solution: obviously at (-2 2 , -2 2 ) T
*** feasible pt x is NOT optimal if we can find a small step s that both retains feasibility and
decreases objective function f to 1st order.
s retains feasibility if
0≤ c 1 (x+ s) ≈ c 1 (x) + ∇ c 1 (x) T s,
so to the 1st order, feasibility is retained if
c 1 (x) + ∇ c 1 (x) T s ≥ 0
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 8
Case I
Case II
x on boundary of circle, c 1 (x)=0, then ∇ f(x) T s< 0 and c 1 (x) + ∇ c 1 (x) T s ≥ 0 become
∇ f(x) T s< 0 (open half space) & ∇ c 1 (x) T s ≥ 0 (closed half space)
Intersection of these two regions is empty only when ∇ f(x) and ∇ c 1 (x) point in the same
direction, i.e. ∇ f(x)= λ 1 ∇ c 1 (x) , λ 1 ≥ 0
Notice sign of λ 1 is significant. If ∇ f(x*) = λ ∇ c 1 (x*) had a negative λ 1 , ∇ f(x) and ∇ c 1 (x)
would point in opposite directions.
Optimality summarized for cases I & II WRT (With Respect To) L(x, λ 1 ),
When no 1st order feasible descent direction exists at x*,
▼xL(x*, λ 1 *)= 0, for some λ 1 *≥ 0
Required condition: λ 1 c 1 (x*)= 0
Complementary Condition implies that λ 1 can be strictly positive only when c 1 is active.
2.2.2 Example with halfdisk region
f(x): min x 1 + x 2 , S.T. c 1 : 8- x 1 2 - x 2 2 ≥ 0, c 2 : x 2 ≥ 0,
So the feasible region is the upper half-disk of (x1^2+ x2^2≤ 8).
T
Solution sits clearly at (- 8 , 0)
T
The conditions ∇ c i (x) d≥ 0, i=1,2,… are BOTH satisfied only if d lies in ∇ c 1 and ∇ c 2 ,
T
but d in this quadrant all give ∇ f(x) d≥ 0.
1 0
1, ∇ c (x*)= 2 8
∇ f(x*)= 1 , ∇ c 2 (x*)= 1 ,
0
Some feasible points are NOT solutions; let’s see their Lagrangian and its gradient at these
points.
T
At x= (- 8 , 0) , both constraints are active yet d such as
T
(-1, 0) would not work. For x= (-1, 0) the ∇ xL(x*, λ*)= 0 condition is only satisfied when
T
λ= (-1/(2 8 ), 1) , and if the first λ 1 is negative, [ ∇ xL(x*, λ*)= 0, for some λ*≥ 0] does not
satisfy.
T
At x= (1, 0) , only c 2 is active. Any small step s away from this point will continue to satisfy
c 1 (x+ s)> 0, we only need to consider c 2 and f behaviour to see if s is indeed a feasible
descent step.
To show that optimality conditions [ ∇ xL(x*, λ*)= 0, for some λ*≥ 0] and
[λ 1 *c 1 (x*)= 0, λ 2 *c 2 (x*)= 0] fail.
2.3 Tangent Cones [TΩ(x*)]
Here we have,
Ω: closed convex, constraint set
x* Ω
F(x*): set of 1st order feasible directions at x*
*The earlier approach of examining first derivative of f and c i , via first order Taylor series
expansion about x to form approximation when both objective and constraints are linear,
only works when the approximation near the point x in question.
If, near x, the linearization is fundamentally different from the feasible set f(x), e.g. entire
plane while feasible set is a single point, then the linear approximation will not yield useful
info.
This is where we must make assumptions about the nature of c i , i.e. for the below,
Constraint qualifications ensure similarity of Constraint Set Ω, its linearized
approximation, in a neighbourhood of x*.
2.4 Feasible Sequence
[15] Given a feasible point x, we call {Z k k} a feasible Sequence approaching x if Z k ∈ Ω
for all k sufficiently large and Z k → x.
A local solution (x*) is where all feasible sequences approaching x* have the property that
f(Z k )≥ f(x) for all k sufficiently large, and we will derive conditions under which this
property holds.
Definition 12.2
d: a tangent (vector) to Ω at a point x if there is a feasible sequence {Z k } approaching x
and a sequence of positive scalars {t k } with t k → 0 so that
lim
k→ ∞ (Z k -x)/t k = d
2.4.1 Linearized Feasible Directions F(x)
Given a feasible point x and active constraint set A(x), the set of linearized feasible
directions F(x) is
F(x)= d T ∇ c i (x)= 0 for all i ∈ ε,
d T ∇ c i (x)≥ 0, for all i ∈ A(x) ∩ Ι
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 11
Note: Definition of tangent cone does not rely on algebraic specification of Ω, only on its
geometry. The linearized feasible direction set depends on the definition of the constraint
functions c i , i ∈ ε ∪ Ι
2.4.2 Example with Tangent Cone ≠ Linearized Feasible Sets
Min f(x) = x 1 + x 2 , such that c i : x 1 2 + x 2 2 - 2= 0
About the constraint: a circle with radius of 2 , and it is near non-optimal point
x= (- 2 , 0) . T
Note: f(x) increases as we move along Z k . f(Z k +1)> f(Z k ) for all k= 2,3,…
As f(Z k )< f(x) for k= 2,3,… so x cannot be a solution.
Via definition
F(x)= d T ∇ c i (x)= 0 for all i ∈ ε,
d T ∇ c i (x)≥ 0, for all i ∈ A(x) ∩ Ι
d= (d 1 , d 2 ) T ∈ F(x) if
⎡2 x1 ⎤ T ⎡d1 ⎤
0= ∇ c 1 (x) T d= ⎢ ⎥ ⎢ ⎥ = -2( 2 )d 1
⎣2 x2 ⎦ ⎣d 2 ⎦
Note: the algebraic specification of Ω has changed. The vector d belongs to the linearized
feasible set if
4( x1 ^ 2 + x2 ^ 2 − 2) x1 ⎡ d1 ⎤ 0 ⎡ d1 ⎤
0= ∇ c 1 (x) T d= T
⎢d ⎥ = 0
T
⎢d ⎥
4( x1 ^ 2 + x2 ^ 2 − 2) x2 ⎣ 2⎦ ⎣ 2⎦
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 12
It looks true for all (d 1 , d 2 ) T . So we have F(x) = ℜ 2 , so for this specification of Ω, the
tangent cone and linearized feasible sets differ.
2.4.3 Example with F(x) = TΩ(x)
Min x 1 + x 2 , S.T. 2- x 1 2 - x 2 2 ≥ 0
Feasible region: On and within the circle
Solution looks pretty obvious at x= (-1, -1) T , i.e. same as the equality-constrained case but
this time we have many feasible sequences converging to any given feasible point.
E.g. from x= (- 2 , 0) T , various feasible sequences defined for the equality constrained
problem are still feasible for this problem. Infinitely many feasible sequences also converge
to
x= (- 2 , 0) T along a straight line from interior of the circle, with the form
Z k = (- 2 , 0) T + (1/k)w,
w: a vector whose first component is positive (w 1 > 0). Z k remains feasible if ||Z k ||≤ 2,
which is true when k≥ (w 1 ^2+ w 2 ^2)/[2 2 w 1 ]
Via definition
F(x)= d T ∇ c i (x)= 0 for all i ∈ ε,
d T ∇ c i (x)≥ 0, for all i ∈ A(x) ∩ Ι ,
d ∈ F(x) if
− 2 x1 ⎡ d1 ⎤
0≤ ∇ c 1 (x) T d= T
⎢d ⎥ = 2 2 d 1
− 2 x2 ⎣ 2⎦
So, we have F(x) = TΩ(x) for this specification of the feasible set.
2.5 LICQ Linear Independence Constraint Qualification
[15]
This applies to the point x and active set A(x), if the set of active constraint gradients
{ ∇ c i (x), i ∈ A(x)} is Linearly Independent.
A local solution (x*) is where all feasible sequences approaching x* have the property that
f(Z k )≥ f(x) for all k sufficiently large, and we will derive conditions under which this
property holds.
LICQ- Linear Independence Constraint Qualification,
This applies to the point x and active set A(x), if the set of active constraint gradients
{ ∇ c i (x), i ∈ A(x)} is Linearly Independent.
if
1. x*= local solution,
2. Function f and c i are continuously differentiable,
3. LICQ holds at x*,
2.6 KarushKuhnTucker (KKT) conditions
[15]
∇ xL(x*,λ*)= 0
c i (x*)= 0 for all i ∈ E
c i (x*)≥ 0 for all i ∈ I
λ*≥ 0 for all i ∈ I
λ*c i (x*) = 0 for all i ∈ E ∪ I
Some additionally mentioned properties. They seem pretty obvious, though worth
mentioning to clear potential ambiguities as we move on.
Lemma 2.6.4
Cone K defined as K= {By+ Cw|y≥ 0},
Given any vector g ∈ R n , we have either that
g ∈ K,
or
There is a d ∈ R^n satisfying (g) T d< 0, (B) T d≥0, (C) T d= 0 but NOT BOTH.
2.6.5 Second Order Conditions
KKT conditions tell us how first derivatives of f and active constraints c i are related at
solution x*. When conditions are satisfied, move along any vector w from F(x*) either
1. increases 1st order approximation to objective function, i.e. w T ∇ f(x*)> 0,
2. or keeps the same value, i.e. w T ∇ f(x*)= 0
The 2nd derivatives of f and constraints c i play the “tiebreaking” role. E.g. for directions
w ∈ F(x*) where w T ∇ f(x*) = 0, we can’t tell if a move along this will increase or decrease f.
The 2nd conditions examine 2nd derivative terms in Taylor series expansions of f and c i to
resolve this.
Essentially, the 2nd order conditions concern the curvature of the Lagrangian function in the
“undecided” directions, where directions w ∈ F(x*) where w T ∇ f(x*) = 0.
Essentially, the 2nd order conditions concern the curvature of the Lagrangian function in the
“undecided” directions, where directions w ∈ F(x*) where w T ∇ f(x*)= 0.
Equivalently,
The critical cone contains directions w that tend to adhere to active inequality constraints
even when we make small changes to the objective or equality constraints
From above we can see λ*= 0 for all inactive components i ∈ I\A(x*), if then follows that
Hence, critical cone C(x*, λ*) contains directions from F(x*) where the first derivative does
not clearly state if f will increase or decrease.
2.6.7 Example on Critical Cones
Min x1, S.T. x 2 ≥ 0, 1-(x 1 -1) ^2- x 2 ^2≥ 0
So the constraint is within & border of a circle of r=1, its centre at (1, 0).
0
Clearly, x*=
0
And its active set A(x*)= {1, 2}→ A(x*) T = ∇ c i (x*) so ∇ c 1 = 1, ∇ c 2 = 2
0
Optimal Lagrange multiplier λ*= → ∇f ( x*) = λ1∇c1 ( x*)
0.5
0 0
The gradients of active constraints at x*: and
1 2
As LICQ holds, the optimal multiplier remains unique.
0
The Critical cone: C(x*, λ*)= { | w 2 ≥ 0}
w2
2.7 Duality
[15]
given,
min f(x), subject to c(x)≥ 0,
with the Lagrangian function [L(x, λ 1 )= f(x)- λ 1 c 1 (x)]
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 16
2.7.1 Duality example
min 0.5(x 1 2 +x 2 2 ) subject to x 1 - 1≥ 0
Lagrangian: L(x 1 , x 2, λ 1 ) = 0.5(x 1 2 +x 2 2 ) - λ 1 (x 1 - 1)
λ 1 is fixed, this is a convex function of (x 1 , x) T , so the infimum (local lower bound) is
achieved when partial derivatives WRT x 1 , x 2 are zero, i.e. x 1 - λ 1 = 0, x 2= 0
max
- 0.5 λ 1 2 + λ 1
λ1 ≥ 0
Apparent solution: λ 1 = 1
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 17
Chapter III. Quadratic Programming
[15] General QP format:
Subject to a i T x= b i i∈ E
T
a i x≥ b i i∈ I
G: symmetric n by n matrix.
E and I are finite sets of indices.
c, x, and {a i },i ∈ E ∪ I, are vectors in R n .
If G is
positive semi definite, then it is a convex QP,
positive definite, it is a Strictly convex QP,
indefinite, it is Nonconvex, more challenging due to multiple stationary points and local
minima.
3.1 Equalityconstrained QPs
form,
min q(x) def 0.5x T Gx+ x T c
x
subject to Ax= b
3.2 KarushKuhnTucker (KKT) matrix
⎡G AT ⎤ ⎡− p ⎤ ⎡ g ⎤
⎢A ⎥ ⎢ ⎥= ⎢ ⎥
⎣ 0 ⎦ ⎣λ * ⎦ ⎣ h ⎦
note
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 18
KKT QP Example
G→ q’(x 1 )= 6x 1 + 2x 2 + x 3
q’(x 2 )= 2x 1 + 5x 2 + 2x 3
q’(x 3 )= x 1 + 2x 2 + 4x 3
In matrix form
⎡6 2 1 ⎤ ⎡ − 8⎤
⎢ ⎥ ⎢− 3⎥ , A= ⎡1 0 1⎤ ⎡3 ⎤
G= 2 5 2 , c= ⎢0 1 1⎥ , b= ⎢0 ⎥
⎢ ⎥ ⎢ ⎥
⎢⎣1 2 4⎥⎦ ⎢⎣− 3⎥⎦ ⎣ ⎦ ⎣ ⎦
⎡G − A t ⎤ ⎡x *⎤ ⎡− c ⎤
We use ⎢ ⎥ ⎢λ *⎥ = ⎢b ⎥
⎣A 0 ⎦ ⎣ ⎦ ⎣ ⎦
⎡6 2 1 −1 0 ⎤ ⎡8 ⎤
⎢2 5 2 0 − 1⎥⎥ ⎢3 ⎥
⎢ ⎡ ⎤ ⎢ ⎥
x *
and get ⎢1 2 4 − 1 − 1⎥ ⎢λ *⎥ = ⎢3 ⎥
⎢ ⎥ ⎣ ⎦ ⎢3 ⎥
⎢1 0 1 0 0 ⎥ ⎢ ⎥
⎢⎣0 1 1 0 0 ⎥⎦ ⎢⎣0⎥⎦
x*= (2,-1,1) T
λ*= (3,-2) T
Many methods currently exist to solve QPs. From this point this paper will focus exclusively
on the most “popularly applied” means via contemporary research.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 19
3.3 Solving QP with A Modified Simplex Method
Jenson and Bard started with an examination of Karush-Kuhn-Tucker conditions, leading to
a set of linear equalities and complement constraints. [10] They then apply a modified
Simplex Algorithm for solutions.
General QP
Min f(x) = cx + 1/2x T Qx
Subject to Ax ≤ b and x ≥ 0
c: an n-dimensional row vector form of coefficients of the linear terms in the objective
function.
Q: an (n by n) symmetric matrix describing the coefficients of the quadratic terms.
x: n-dimensional column vector denoting decision variables (as in linear programming)
A: an (m by n) matrix defining constraints
b: defining constraints’ right-side coefficient column vector
The model drops constants in the equation. We assume a feasible solution exists and that the
constraint region is bounded. A global minimum exists when f(x) is strictly convex for all
feasible points.
KKT Conditions
A positive definite Q guarantees strict convexity.
L xj ≥ 0, j= 1,…, n c + x T Q + μA ≥ 0 (12a)
L μi ≤ 0, i= 1,…, m Ax – b ≤ 0 (12b)
x j L xj = 0, j= 1,…, n x T (c T + Qx + A T μ) = 0 (12c)
μ i g i (x) = 0, i= 1,…, m μ(Ax – b)= 0 (12d)
x j ≥ 0, j= 1,…, n x≥0 (12e)
μ i ≥ 0, i= 1,…,m μ≥0 (12f)
The KKT conditions now rewritten with constants moved to the right side
Qx + A T μ T - y = -c T (13a)
Ax + v = b (13b)
x ≥ 0, μ ≥ 0, y ≥ 0, v ≥ 0 (13c)
T
y x = 0, μv = 0 (13d)
So the first two expressions have become linear equalities, the third restricts all the variables
nonnegative, and the fourth applies complementary slackness.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 20
They then apply a simplex algorithm to solve 13a-13d, with a “restricted basis entry rule”
to treat the complementary slackness conditions (13d) implicitly. It takes the following steps,
• Let structural constraints 13a and 13b be defined by KKT conditions
• If any right-side values must always stay positive, (* -1) if need be
• Addition of an artificial variable in each equation
• The objective function becomes the sum of artificial variables
• The resulting problem goes into simplex form
They aim to find a solution to the linear program that minimizes sum of the artificial
variables with the additional requirement that the complement slackness conditions are
satisfied at each iteration. If the sum becomes zero, the solution then satisfies 13a-13d. To
accommodate 13d, rule for selecting entering variables must bear the following relationships,
The entering variable will have the most negative reduced cost, provided that its
complementary variable is not in the basis or could leave the basis on the same iteration. The
algorithm ultimately provides vector x as the optimal solution and vector μ the optimal dual
variables.
Note:
According to the authors, this approach works only well when the objective function is
positive definite, while requiring computational effort comparable to a linear programming
problem with m + n constraints (where m is the number of constraints and n number of QP-
variables).
⎡− 8 ⎤ ⎡ 2 0⎤ ⎡1 1 ⎤ ⎡5⎤
cT = ⎢ ⎥ , Q= ⎢0 8 ⎥ , A = ⎢1 0⎥ , b= ⎢3⎥
⎣− 16⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
We can see that Q is positive definite so the KKT conditions are necessary and sufficient for a
global optimum.
The linear constraints 13a and 13b take the following form,
2x 1 + μ1 + μ 2 - y1 =8
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 21
8x 2 + μ1 - y2 = 16
x1 + x2 +v 1 =5
x1 +v 2 =3
Then they add the artificial variables to each constraint and minimize the sums.
Minimize a 1 + a 2 + a 3 + a 4
subject to
2x 1 + μ1 + μ 2 - y1 +a 1 =8
8x 2 + μ1 - y2 +a 2 = 16
x1 + x2 +v 1 + a3 =5
x1 +v 2 + a4 =3
Chapter IV. SVM Mechanics
SVMs map data nonlinearly into a high (potentially infinite) dimensional feature space via a
kernel function, where data becomes easily separable linearly by a hyperplane [14]. Distance
(margin) between the points (Support Vectors) and the hyperplane are then maximized. It
results in nonlinear regressions in low dimensional space, where extrapolations (of improved
accuracy) become possible, i.e. classifications or forecasting.
4.1 General overview
We commence with a data set G=(X i , y i ) iN=1 of N data points, where X i denotes input space
(predictive variables) and y i the corresponding output (response variables). ε -SVR then
works out a function f(x) that has a maximum of ε deviation from the actually obtained y i
for all the training data while keeping f(x) as flat as possible. Therefore, we end up with an
“expected” range for extrapolation errors, making the findings more practical. [9]
ω i : coefficients
b: threshold value
^
Dot product takes place between ω and X. g (X i ) serves as the hyperplane in F defined by
the functions ф(X) where dimensionality could get very high, possibly infinite.
A small ω would lead to improved flatness. One way to find a small ω, we could minimize the
norm, i.e. || ω||^2= <ω, ω>.
The SVR QP
The SVR becomes a minimization problem with slack variables ξ and ξ*
N
Minimize 1/2 ω 2
+C ∑
i =1
(ξ i + ξ* i ) [16]
^
subject to g (X i )+ b- y i ≤ ε + ξ i
^
y i - [ g (X i )+ b]≤ ε + ξ* i
ξ i , ξ* i ≥ 0
This is solved with the Lagrangian and transforming into a dual problem,
^ ^ N
f (t+1)= g (X)= ∑i =1
(α* i - α i )K(X,X i )+ b
α* i , α i : Lagrange Multipliers,
associated with X i ,
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 23
0≤ α* i , α i ≤ C,
N
∑
i =1
(α* i - α i )= 0
Training points with nonzero Lagrange multipliers are called Support Vectors (SV), where
the smaller the fraction of SVs, the more general the solution.
Coefficients α* i , α i are obtained by maximizing the following form subject to earlier stated
conditions,
N N
R(α* i ,α i )= ∑
i =1
y i (α* i - α i )- ε ∑i =1
(α* i + α i )
N
-1/2 ∑
i , j =1
(α* i - α i )(α* j - α j ) K(X i ,X j )
The training vectors x i are mapped into a higher (potentially infinite) dimensional space by
the kernel function Φ. Then the SVM finds a linear hyperplane separating the (support)
vectors in this space with maximal margin (distance to hyperplane). [12]
4.2 Structural Risk Minimization (SRM)
Popular financial time series models tend to “over fit”, i.e. focusing exclusively on quality of
fitting training data (empirical error), that structural risk (potential forecast error) becomes
perilously volatile.
SVMs offer uniqueness as a learning algorithm by applying SRM, i.e. they utilize capacity
control of the decision function, the kernel function and the sparsity of solutions (Huang,
Nakamori, Wang 2005). The SRM principle helps SVMs estimate functions while minimizing
generalization error, therefore making SVM classifications and SVR highly resistant to over-
fitting issues.
Let’s say the SVR aims to estimate/learn a function f(x, λ) where X is the input space
(e.g. stock index prices together with econometric indicators). Given λ ∈ Λ, a set of abstract
parameters from an Independent Identically Distributed (i.i.d) sample set with size N.
(x 1 , y 1 ),…,(x N , y N ), x i ∈ X(R d ), y i ∈ R (1.1)
The training data set (x i , y i ) belongs to an unknown distribution P(x i , y i ).
So we look to find a function f(x, λ*) with the smallest possible value for the expected risk
(or extrapolation error) as
With the probability distribution P(x, y) in (1.2) unknown, computing and minimizing R[λ]
poses an impossibility, so far. However, we do have some information of P(x, y) from the i.i.d
sample set (1.1) earlier, and it becomes possible to compute a stochastic approximation of
R[λ] by the Empirical Risk.
N
R emp [λ]= 1/N ∑
i =1
l[y i , f(x i , λ)] (1.3)
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 24
According to Yang, the law of large numbers results in empirical risk converging with respect
to the statistical expectancy. Despite this, for small sample sets, minimizing empirical risk
alone could lead to issues such as loss of accuracy or over-fitting.
From (1.4) it becomes apparent that to achieve small expected risk, i.e. improved accuracy,
the empirical risk, ratio between VC-dimension and data points must stay relatively small. As
a decreasing function of h usually represents empirical risk, optimal values for VC-dimension
exists with respect to the given number of samples. Therefore when given a relatively limited
number of data points, a designated value for h (often controlled by free model parameters)
plays the key to superb performance.
The above led to the technique of SRM (Vapnik and Chervonenkis 1974), in an attempt to
choose the most appropriate VC-dimension.
Overall, SRM is an inductive principle to optimize the trade-off between hypothesis space
complexity and empirical error. The resulting capacity control offers minimum test error
while keeping the model as “simple” as possible.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 25
• With given domain, a function is selected (e.g. nth degree polynomials, n-layer neural
networks, n-rules fuzzy logic).
• Function classes are divided into nested subsets, with respect to order of complexity.
• Empirical risk minimization occurs at each subset, i.e. general parameter selection.
SRM therefore naturally holds the potential to dramatically improve stability of forecast
errors. As financial markets characterize exceptionally noise saturated processes, consistency
in prediction errors serves as the next best thing for practical utilization.
4.3 The Loss Function
The loss function measures empirical risk. Many types exist, see some below, [17]
Linear ε ‐insensitive |δ | ε 1
exp(‐| δ | ε )
2(1 + ε )
Gaussian ½δ 2
1 δ2
exp(‐ )
2π 2
Huber’s Robust ⎡1 /(2σ ) * δ 2 , if | δ |≤ σ ⎧ δ2
⎢ ⎪⎪exp(− ), if | δ |≤ σ
⎣| δ | −σ / 2, otherwise Α⎨ 2σ
σ
⎪exp( − | δ |), otherwise
⎪⎩ 2
Polynomial 1/d| δ | d d
exp(‐| δ | d )
2Γ(1 / d )
The target values y are generated by an underlying functional dependency f plus additive
noise δ of density ρ δ . Then minimizing R emp coincides with the follow,
The squared loss however does not always result as the best choice. The above ε -insensitive
loss function is also popular, and applied for ε -SVR applied later in this study.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 26
⎧0, if | y − f ( x) |< ε
l ε (y, f(x))= ⎨ (1.8)
⎩| y − f ( x) | −ε , otherwise
The ε -insensitive function do not present output errors as long as the data points remain
within the range of ± ε . Therefore increasing ε values would likely reduce the number of
support vectors, at extreme ends it may result in a constant regression function. Therefore,
the loss function indirectly affects the complexity and generalization of models.
4.4 ε SVR
With the ε -insensitive function advantages, ε -SVR has become quite helpful for regression
type of data sets. It aims to find a function f with parameters w and b by minimizing the
N
following regression risk, R reg (f)= ½<w, w> + C ∑
i =1
l(f(x i ),y i ) (1.5)
4.5 Standard Method for SVR QP Solutions
The standard optimal solution for (1.9), the function f as [17]
N N
+ ∑
i =1
(ε - y i ) α i + ∑
i =1
( ε + y i ) α *i (1.11)
N
subject to ∑i =1
( α i - α *i ) = 0, α i(*) ∈ [0, C] (1.12)
N
f(x)= ∑
i =1
(α i - α *i )<Φ(x i ), Φ(x)>+ b
With the QP solved, we still need to find value of b. KKT conditions help in this case,
α i ( ε + ξ i - y i + <w, Φ(x i )> + b) = 0
α *i ( ε + ξ *i + y i - <w, Φ(x i )> - b) = 0,
and
(C - α i ) ξ i = 0
(C - α *i ) ξ *i = 0,
4.6 The Decomposition Method
[8]
Real life QPs however do not come in nice forms and require heavy computing power to
solve. The Decomposition Method has become a general format in solving SVM QPs via an
iterative process.
This method modifies a subset of α per iteration. This subset, working set B, leads to a small
sub-problem to be minimized each iteration. Then in each iteration we solve a simple 2-
variable problem.
2. Find a 2-element working set B= {i,j} by Working Set Selection (WSS) defines
N {1,…, l}\B and α and α to be sub-vectors of α corresponding to B and N respectively.
3. If α K + K - 2K > 0
solve this sub-problem with variable α :
½[ α α ] + (PB+ Q α )
Subject to 0≤ α , α ≤ C
y α + y α = Δ- y α (3.3)
If else, solve
½[ α α ] + (PB+ Q α )
+ (( α - α ) + ( α - α ) ) (3.4)
4. Set α as the optimal solution of (3.3) and α α .
Set k+1→ k and repeat step 2
So B is updated with eat iteration. If α ≤ 0, (3.3) is a concave problem, and we use the (3.4)
convex modification.
Stopping criteria and WSS for C-SVC, ε-SVR, and One-class SVM
KKT optimality condition says that a vector α is a stationary point of the general form if and
only if there’s a number b and two nonnegative vectors λ and μ such that
f(α)+ by= λ- μ
λ α = 0, μ (C- α )= 0, λ ≥ 0, μ ≥ 0, i= 1,…, l
where f(α) Q α+ P is gradient of f(α), rewritten:
f(α) + by ≥ 0 if α < C
f(α) + by ≥ 0 if α > 0
WSS 1
1. For all t, s, define
a K + K - 2K ,b -y f(α ) + y f(α ) > 0
and
select
i arg {- y f(α ) | t I (α )}
j arg {- | t I (α ), - y f(α ) < y f(α ) }
2. Return B= {i,j}
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 29
4.7 The Kernel Function
[3]
The name Kernel comes from “integral operator theory”. They contain mostly supporting
theories on relations between kernels and associated feature spaces.
SVR applies the mapping function Φ to solve non-linear samples, such as financial data sets.
It maps the input space X into a new space Ω = { Φ(x) | x ∈ X}, i.e. the mapping function
makes x = (x 1 ,…,x N ) become Ω= Φ(x) = [Φ 1 (x),…, Φ N (x)]. Then in the feature space Ω, we
can obtain a linear regression function.
N N
+ ∑
i =1
(ε - y i ) α i + ∑
i =1
( ε + y i ) α *i
N
subject to ∑
i =1
( α i - α *i ) = 0, α i(*) ∈ [0, C]
we notice that the objective function contains an inner product of the mapping function
Φ(x). The inner product lets us specify a kernel function WITHOUT considering the
mapping function or the feature space explicitly. With this advantage, we then create the
Kernel function,
Since feature vectors are not expressed explicitly, the number of inner product computations
do not always reflect the number of features in exact proportions. The kernel makes it
possible to map data implicitly into a feature space for training, while evading potential
problems in evaluating feature map. The Gram/kernel matrix remains the only information
applied in training sets.
The RBF Kernel usually comes as a reasonable first choice. It non-linearly maps samples into
higher dimensional space, so it would likely handle financial (non-linear) data better than
the linear kernel. At the same time, it does not become as complex as the Polynomial or run
into validity issues from the Sigmoid type.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 30
Note: the linear and sigmoid kernels have shown to behave like RBF given certain
parameters (Keerthi and Lin 2003, Lin and Lin 2003)
Also, the polynomial kernel may become too complex as it has more hyper-parameters than
RBF. The RBF kernel has less numerical difficulties, e.g. 0< k ij ≤ 1 for RBF, where for
polynomial kernel as degree increases values may go to infinity ( γ x i T x j + r> 1) or zero
( γ x i T x j + r< 1).
4.8 Cross Validation and Gridsearch
This is to find the best values of (C, γ ) for optimal predictions. [4]
v-fold cross-validation, we first divide training set into v subsets of equal size. Each subset is
then tested using the classifier trained on the other (v- 1) subsets. So, each instance of the
whole training set is predicted once and the cross-validation accuracy is the data% that are
correctly classified.
Grid-search: pairs of (C, γ ) are tried and one with best cross-validation is picked, and it is
completed.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 31
Chapter V. Empirical Analysis and Findings
5.1 About Financial Time Series
The financial markets do not function arbitrarily at random. Just as natural resources are
limited, so does liquidity of financial instruments over the exchanges [2]. Supply and
demand drive price moves, i.e. basic economics. To study prices of the immediate future
then, one must investigate forces affecting market supply or demand.
Many researchers apply historical response variable (predicted output) as also the exclusive
input, implying that financial time series are dependent and stationary in nature. This is
fundamentally flawed as institutions (hedge funds, investment banks, George Soros, etc.)
who execute price-moving trades largely make decisions with respect to economic conditions
exclusively.
Yes financial time series contain quite a bit of noise, making precise value forecasts
challenging. Short term intra-day price change predictions require deep analysis of market
depth (bid/ask volume changes at market-maker, specialist levels and etc.) [1].
Notwithstanding the foregoing, empirical evidence suggests noteworthy relations between
financial instruments and economic factors over longer time frames.
5.2 Analysis Setup
This study aims to forecast ASX200 stock index value (range) and directional bias. The
predictions result from interpretations of several independent, significantly correlated
financial time series.
Historically, equity (and commodity, real estate) markets often respond to inter-market
interactions involving factors such as interest rates, political events, traders’ expectations,
and etc. Inflationary, liquidity, sentiment measures independent from ASX200 (the response
variable) then make viable predictive inputs. This study therefore commences on a
fundamentally sound structure.
5.2.1 General idea
The training data set (X i , Y i +1 ) holds daily closing prices of and derived quantities thereof.
Y i +1 : Set of output/response variables applied for training at day (i +1)
Xi : Set of input variables matched at day (i)
i = 1,…,N
N= 1,707
The initial training set (X i , Y i +1 ) starts from 31/10/01 and completes at the week ending
Mar. 08.
The training model allows SVR to create a linear model in high dimensional space matching
input sets with output values of next trading day.
Once training completes, we can apply the SVR designed regression model on testing data
X t to make next-day forecasts of Y t +1 .
5.2.2 Training Output Selection (Y i +1 )
Yahoo Finance serves as the (free) data provider. The following are the 1-day forward
response variables,
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 32
y i +1 : (Ticker Symbol: ^axjo) This is the ASX200 closing price for the day (i+1),
yr: ASX200 1-day return for day (i+1). The formula, y r = y i +1 /y i - 1. So it will appear in
the form of a percentage.
5.2.3 Training Input Selection (X i )
While numerous economic factors contribute to effective valuation of global equity indexes
(including the ASX200), the nature of this paper focuses on the mathematical potential of
SVR; the predictive variables then shall stay relatively minimal, the few selected however
hold significant correlations to global equity markets.
The predictive time series are acquired from ratesfx.com and Yahoo Finance and listed
below,
1. AUD: The value of 1 AUD (Australian Dollar) in USD (US Dollar). The exchange rate
plays an important role, as arbitrageurs, particularly institutional program-trading
machines exploit price discrepancies between international exchanges. Program-
trading deals with large volumes, often enough to have significant price moving
impact. [7]
2. VIX: (Ticker Symbol: ^vix) The CBOE (Chicago Board of Options Exchange)
Volatility Index. This index reflects the average implied volatility of near-the-money
S&P500 index options. As the S&P500 remains the most prevalently referenced
American stock index, the VIX reflects “expected” market volatility of the immediate
future and thereby sentiment of American traders.
3. GOX: (Ticker Symbol: ^gox) The CBOE Gold Index. Throughout the ages, gold has
always served well as an instrument of fixed intrinsic value, i.e. insignificant/no
depreciation nor real added value. Therefore, gold price presents a near-perfect
indication of real rate of inflation.
The derived changes in VIX and GOX should reflect investor sentiment and inflationary
concerns respectively. As the speed (derivative) of volatility and rate of inflation appear
stationary, historically referenced, these measures should contribute as significant predictive
variables.
5.2.4 Testing Variables
Xt : All of the training input variables on the day (t).
Z t +1 : The forecast set ASX200 index (z t +1 ) and 1-day return (z r ), at (t+1).
Test data for (z t +1 ) ranges from 3/9/08 to the week ending 9/4/08.
Test data for z r ranges from 10/12/08 to week ending 9/4/08.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 33
5.2.5 Error Analysis
εc: (z t +1 ) test error. Formula (ε c = y t +1 /z t +1 -1)
εr: (z r ) test error. Formula (ε r =y r (t)/z r (t)-1)
5.2.6 SVR Adaptive Training Strategy
Since financial time series do not exhibit stationary nature, distant extrapolations usually
present inconsistent, unreliable performance. To remedy this, for every y t +1 , all training data
adjusts to complete at day (t), thereby adapting with each sequential test.
Example,
Training set (X 1 ,…,X 100 , Y 2 ,…,Y 101 ) leads to test variables (X 101 , Z 102 ), then the next
adapted training set (X 2 ,…,X 101 , Y 3 ,…,Y 102 ) leads to test variables (X 102 , Z 103 ).
This slight tweak of the machine learning process resulted in significantly more contained
Z t +1 test errors. It lets the SVR to continuously adjust and learn from the daily changes in x i .
The extrapolation then remains always one time step (i.e. one business day) forward, thereby
eliminating nonstationarity-related testing error.
5.2.7 Applied Software
Many current SVM researchers have created software toolsets for Matlab. LS-SVM (for Least
Squared SVM) turned out the most user-friendly and therefore applied in this study. The
adaptive training modification however makes it require a bit more legwork as it does not
carry this capability by default.
For test error and associated analysis, Minitab does the job adequately. Excel hands the raw
data.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 34
5.3 Empirical Findings
Below entails some statistical analysis of SVR forecast results and errors.
With ε c at such similar pace with actual ASX200, we could potentially exploit the index price
moves if ε c displays likelihood of remaining stationary.
5.3.1 Cross Correlation Analysis
⎧TimeSeries1 : yt
⎨
⎩TimeSeries 2 : z t
With correlation significantly positive as lag escalates, particularly at roughly 29 time steps,
the ASX200 index seems to “follow” the SVR derived extrapolation value. A negative
correlation with respect to negative lag suggests an equivalent likelihood. An analysis of the
error terms could support this idea.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 36
⎧TimeSeries1 : y r
⎨
⎩TimeSeries 2 : z r
It appears that an edge, while not significantly large, still exists for the 1-day return forecast.
The apparently spurious correlation values as absolute Lag increases proposes that z r should
be applied for the associated day exclusively.
5.3.2 Normality Tests
These tests offer a feel for the error distributions, as consequent analysis depend on these
conclusions.
The ASX200 closing price valuation error (ε c ) distribution does not appear normal,
therefore a nonparametric analysis is required to analyze the data.
Interestingly, the 1-day return forecast error does (likely) resemble a normal distribution.
Therefore a parametric approach viably follows below.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 38
5.3.3 Error Distributions
Let us look at the distributions and perhaps get a feel for their behaviour.
(ε c )
It appears largely negative, as we have had so much panic motivated selling since late 2007.
Interestingly, (ε c ) did not present any considerable outliers despite jumps in market
volatility throughout the test period.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 39
(ε rt )
In spite of the relatively small sample, the negative skew and definite kurtosis present
quantities concurring with actual equity market behavior. Negative skew reveals the nature
for stock prices to drop harder than to grow, partly as a result of inherent credit risk
associated with each exchange listed entity. A positive kurtosis coincides with the way
financial instrument values tend to move sporadically, as volatility, traders’ sentiment does
not stand still (evidenced empirically).
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 40
5.3.4 The Gold Connection
On a coincidental note, the AXJO shows a considerably high positive correlation with the
GOX (see below graph, y= correlation, x= i), dating from Oct. 01 to Feb. 09.
1.50
1.00
0.50
0.00
0 500 1000 1500 2000
-0.50
-1.00
The correlation suggests that the ASX200 index largely adjusts with the real rate of inflation,
therefore no real significant growth in value has occurred for it in the past decade roughly. It
also implies that gold remains a significant factor in pricing of equity indexes.
5.4 Test Conclusions
SRM (Structural Risk Minimization) has really shined throughout the experiment. Despite
recent shocks in global financial economics, the SVR derived quantities maintained a fairly
stable bound of errors. This means practicality for the professional traders.
Having an accurate forecast of price ranges definitely helps, it makes it possible to literally
“buy low sell high”. Though not quite the Holy Grail, SVR offers promising means to exploit
economic inefficiencies, perhaps opening doors to a new frontier.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 41
Reference
[1] Almgren, R., Thum, C., Hauptmann, E., Li, H. (2006). Equity market impact (Quantitative
trading, 2006/14). University of Toronto, Dept. of Mathematics and Computer
Science; Citigroup Global Quantitative Research, New York.
[2] Boucher, M. (1999). The Hedge Fund Edge. New York: John Wiley & Sons, Inc.
[3] Cao, L., Tay, F. (2001) Financial Forecasting Using Support Vector Machines (Neural
Comput & Applic (2001)10:184-192). Dept. of Mechanical and Production Engineering,
National University of Singapore, Singapore.
[4] Chang, C., Lin, C. (2009). LIBSVM: a Library for Support Vector Machines. Dept. of
Computer Science, National Taiwan University, Taipei.
[5] Chen, P., Fan, R., Lin, C. & Joachims, T. (2005) . Working Set Selection Using Second
Order Information for Training Support Vector Machines (Journal of Machine Learning
Research 6 (2005) 1889-1918). Department of Computer Science, National Taiwan
University, Taipei.
[6] Claessen, H., Mittnik, S. (2002). Forecasting Stock Market Volatility and the Informational
Efficiency of The DAX-index Options Market Johann Wolfgang Goethe-University,
Frankfurt.
[7] Dubil, R. (2004). An Arbitrage Guide To The Financial Markets. West Sussex, England:
John Wiley & Sons, Ltd.
[8] Glasmachers, T., Igel, C. (2006). Maximum-Gain Working Set Selection for SVMs (Journal
of Machine Learning Research 7 (2006)1437-1466). Ruhr-University, Bochum Germany.
[9] Huang, W., Nakamori, Y. & Wang, S. (2004). Forecasting stock market direction with
support vector machine (Computers & Operations Research 32 (2005) 2513-2522) School of
Knowledge Science, Japan Advanced Institute of Science and Technology; Institute of
Systems Science, Academy of Mathematics and System Sciences, Chinese Academy of
Sciences, Beijing.
[10] Jenson, P., Bard, J. (2008). Operations Research Models and Methods. New York: Wiley,
John & Sons, Inc.
[11] Joachims, T. (1998). Making Large-Scale SVM Learning Practical. Cambridge, USA: MIT
Press.
[12] Kecman, V. (2001). Learning from data, Support vector machines and neural networks.
Cambridge, USA: MIT Press.
Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 42
[13] Kumar, M., Thenmozhi, M. (2005). Forecasting Stock Index Movement: A Comparison of
Support Vector Machines And Random Forest. Dept. of Management Studies, Indian
Institute of Technology, Chennai.
[14] Li, B., Hu, J. & Hirasawa, K. (2008). Financial Time Series Prediction Using a Support
Vector Regression Network. Graduate School of Information, Waseda University, Japan.
[15] Nocedal, J., Wright, S. (1999). Numerical Optimization. New York: Springer-Verlag.
[16] Smola, A., Scholkopf, B. (2003). A Tutorial on Support Vector Regression (NeuroCOLT
Technical Report TR-98-030). RSISE, Australian National University, Canberra, Australia &
Max-Planck-Institut fur biologische Kybernetik, Tubingen, Germany.
[17] Yang, H. (2003). Margin Variations in Support Vector Regression for the Stock Market
Prediction. Dept. of Computer Science & Engineering, The Chinese Univerisity of Hong
Kong, Hong Kong.