SVR FinancialForecasting RChen 1.3.3

Rocko Chen
Financial Forecasting With Support Vector Regression (2009) 1
Financial Forecasting
With
Support Vector Regression
Rocko Chen
(rchen@publicist.com)
School of Computing & Mathematical Sciences,

Auckland University of Technology
Supervised by Dr Jiling Cao
30th of April, 2009

Rocko Chen
Acknowledgement
This study was made possible via guidance and support of the following persons:
Dr. Jiling Cao, my supervisor, for his vital facilitation and directives.
Neil Binnie, the overseer, for his understanding and assistance.
Peter Watson, our Programme Leader, for the unremitting encouragement.
All School of Computing & Mathematical Sciences faculty members and Staff.
Joshua Thompson, friend, professional trader, for the pragmatic perspective and much
needed motivation.
Most especially to the countless Ph.D. researchers whose works I have studied throughout
the past year.
Rocko Chen
Table of Content
Abstract ........................................................................................................................................ 5
Chapter I. Introduction ........................................................................................................ 6
Chapter II. Fundamentals of Optimization........................................................................... 7
2.1. Optimization General Structure ....................................................................................... 7
2.2 The Lagrangian Function .................................................................................................. 7
2.2.1 Example with Lagrangian Function............................................................................ 7
2.2.2 Example with half-disk region.................................................................................... 8
2.3 Tangent Cones [TΩ(x*)] .................................................................................................. 10
2.4 Feasible Sequence ............................................................................................................ 10
2.4.1 Linearized Feasible Directions F(x) .......................................................................... 10
2.4.2 Example with Tangent Cone ≠ Linearized Feasible Sets.......................................... 11
2.4.3 Example with F(x) = TΩ(x)........................................................................................12
2.5 LICQ- Linear Independence Constraint Qualification ....................................................12
2.6 Karush-Kuhn-Tucker (KKT) conditions ..........................................................................13
2.6.5 Second Order Conditions...........................................................................................14
2.6.7 Example on Critical Cones .........................................................................................15
2.7 Duality ...............................................................................................................................15
2.7.1 Duality example ..........................................................................................................16
Chapter III. Quadratic Programming ....................................................................................17
3.1 Equality-constrained QPs .................................................................................................17
3.2 Karush-Kuhn-Tucker (KKT) matrix.................................................................................17
3.3 Solving QP with A Modified Simplex Method .................................................................19
Chapter IV. SVM Mechanics ................................................................................................. 22
4.1 General overview .............................................................................................................. 22
4.2 Structural Risk Minimization (SRM) .............................................................................. 23
4.3 The Loss Function............................................................................................................ 25
4.4 ε -SVR .............................................................................................................................. 26
4.5 Standard Method for SVR QP Solutions ......................................................................... 26

Rocko Chen
4.6 The Decomposition Method ............................................................................................ 27
4.7 The Kernel Function ........................................................................................................ 29
4.8 Cross Validation and Grid-search ................................................................................... 30
Chapter V. Empirical Analysis and Findings .......................................................................31
5.1 About Financial Time Series .............................................................................................31
5.2 Analysis Set-up..................................................................................................................31
5.2.1 General idea ................................................................................................................31
5.2.2 Training Output Selection (Y i +1 ) ...............................................................................31
5.2.3 Training Input Selection (X i )................................................................................... 32
5.2.4 Testing Variables....................................................................................................... 32
5.2.5 Error Analysis............................................................................................................ 33
5.2.6 SVR Adaptive Training Strategy............................................................................... 33
5.2.7 Applied Software ....................................................................................................... 33
5.3 Empirical Findings........................................................................................................... 34
5.3.1 Cross Correlation Analysis ........................................................................................ 35
5.3.2 Normality Tests ......................................................................................................... 36
5.3.3 Error Distributions.................................................................................................... 38
5.3.4 The Gold Connection ................................................................................................ 40
5.4 Test Conclusions .............................................................................................................. 40
Reference.....................................................................................................................................41
Rocko Chen
Abstract
This study explores Support Vector Machines (SVMs) for the purpose of forecasting the
ASX200 index. SVMs have become popular within the financial forecasting realm for its
unique ability, Structural Risk Minimization (SRM).
The paper commences with a review of relevant numerical optimization concepts. Going
from basic notions of Lagrangian Functions to quadratic programming, necessary
foundations are laid for the reader to comprehend key ideas of SVR.
SVR details ensue. This section explores SVR core analytical and SRM processes. Key
theories are explained with examples.
The final section involves an empirical test where SVR attempts to valuate the ASX200 stock
index and forecast next day return. The test applies roughly 7 years of daily closing prices of
several predictive variables for model establishment. Optimal test results follow an adaptive
training approach. Test findings along with SRM effectiveness are analysed hereafter.
Disclaimer-
Rocko Chen retains all rights to the above content in perpetuity. Chen grants others to copy
and re-use the document for non-commercial purposes provided that Chen is given credit.
Further, the content remains primarily a subject of research and the author does not take
liability for any potential losses via utilization. April 28, 2009
Rocko Chen
Chapter I. Introduction
Developed by Vladimir Vapnik [15], SVMs distinguish from popular machine learning
algorithms for several unique characteristics. They offer robust classifications/predictions
while maintaining structural risk minimization.
Some interesting accuracy comparisons [14]

#training #testing Accuracy Accuracy
applications data data #features #classes by users via SVM
Astroparticle 3,089 4,000 4 2 75.20% 96.90%
Vehicle 1,243 41 21 2 4.88% 87.80%
Apparently machines have learned more efficiently than humans by significant numbers.
Even in the machine learning community, SVMs have displayed superiority over processes
such as Discriminant Analysis (linear or quadratic), Logit Models, Back Propagation Neural
Networks and others [13]. These qualities therefore make SVR desirable for financial
forecasting.
Though unlikely random according to growing academic evidence, the financial markets
represent high noise, non-linear, and non-stationary processes. Throughout the years many
researchers have attempted to make predictions with various statistical methods, including
SVR, with some promise. However, some fundamental assumptions had limited applicability
of the studies.
The empirical study examines SVR (assumption based) weaknesses and attempts to remedy
them via data selection strategies. Consequent findings point to some interesting elements.
On the whole, SVR could potentially contribute significant value for the professional
financial instrument trader.
Rocko Chen
Chapter II. Fundamentals of Optimization

2.1. Optimization General Structure
Given f(x) the objective function,
Min f(x) x within R n
Subject to c i (x)=0, if i within Equality constraints
c i (x)=>0, if i is within Inequality constraints
Feasible set Ω= set of x that satisfy the constraints.
A local solution, x*, serves as a vector within neighbourhood N, inside Ω, such that
f(x)≥ f(x*).
An Active Set, A(x), at any feasible x consists of both Equality and Inequality constraint
indices for which c i (x)= 0,
i.e. A(x)= EỤ{i within I| c i (x)= o}
e.g. at a feasible point x, (i within I) is active if c i (x)=0
(i within I) is inactive if c i (x)>0
2.2 The Lagrangian Function
L(x, λ 1 ) = f(x) - λ 1 c 1 (x), where λ 1 = Lagrangian Multiplier
then, ∇ L(x, λ 1 )= ∇ f(x)- λ 1 ∇ c 1 (x), also applies to ∇ f(x*)= λ 1 ∇ c 1 (x*)
2.2.1 Example with Lagrangian Function
min x 1 + x 2 , S.T. (Such That) 8- x 1 2 - x 2 2 ≥ 0
Feasible region: interior & border of circle x 1 2 + x 2 2
=8
Constraint normal: ∇ c 1 (x) points toward interior at boundary pts
Solution: obviously at (-2 2 , -2 2 ) T
Recalling ∇ L(x, λ 1 )= ∇ f(x)- λ 1 ∇ c 1 (x),

then we have ∇ L(x, λ 1 )= ∇ (x 1 +x 2 )- λ ∇ (1 - x 1 2 - x 2 2 ),
= -2 - 2 - λ (-2x 1 -2x 2 )
=-4 + 4λ(-2 -2)
1 = λ(-4)
λ*= -0.25, the Lagrange multiplier plays a significant role in this inequality-constrained
problem.
*** feasible pt x is NOT optimal if we can find a small step s that both retains feasibility and
decreases objective function f to 1st order.
s retains feasibility if
0≤ c 1 (x+ s) ≈ c 1 (x) + ∇ c 1 (x) T s,
so to the 1st order, feasibility is retained if
c 1 (x) + ∇ c 1 (x) T s ≥ 0
Rocko Chen
Case I
x lies strictly inside circle: c 1 (x)> 0

So any step vector s, with sufficiently small length, satisfies c 1 (x) + ∇ c 1 (x) T s ≥ 0
In fact, whenever ∇ f(x) ≠ 0, we can get a step s that satisfies
∇ f(x) T s< 0 & c 1 (x) + ∇ c 1 (x) T s ≥ 0 by setting
s = -α ∇ f(x), α: positive scalar sufficiently small.

However, when ∇ f(x) = 0, no step s is given.
Case II
x on boundary of circle, c 1 (x)=0, then ∇ f(x) T s< 0 and c 1 (x) + ∇ c 1 (x) T s ≥ 0 become
∇ f(x) T s< 0 (open half space) & ∇ c 1 (x) T s ≥ 0 (closed half space)
Intersection of these two regions is empty only when ∇ f(x) and ∇ c 1 (x) point in the same
direction, i.e. ∇ f(x)= λ 1 ∇ c 1 (x) , λ 1 ≥ 0
Notice sign of λ 1 is significant. If ∇ f(x*) = λ ∇ c 1 (x*) had a negative λ 1 , ∇ f(x) and ∇ c 1 (x)
would point in opposite directions.
Optimality summarized for cases I & II WRT (With Respect To) L(x, λ 1 ),
When no 1st order feasible descent direction exists at x*,
▼xL(x*, λ 1 *)= 0, for some λ 1 *≥ 0
Required condition: λ 1 c 1 (x*)= 0
Complementary Condition implies that λ 1 can be strictly positive only when c 1 is active.
2.2.2 Example with halfdisk region
f(x): min x 1 + x 2 , S.T. c 1 : 8- x 1 2 - x 2 2 ≥ 0, c 2 : x 2 ≥ 0,
So the feasible region is the upper half-disk of (x1^2+ x2^2≤ 8).
T
Solution sits clearly at (- 8 , 0)
We expect a direction d of 1st order feasible descent to satisfy

T T
∇ c i (x) d≥ 0, i I= {1, 2,…}, ∇ f(x) d< 0,
However no such direction could exist at

T
X = (- 8 , 0)
T
The conditions ∇ c i (x) d≥ 0, i=1,2,… are BOTH satisfied only if d lies in ∇ c 1 and ∇ c 2 ,
T
but d in this quadrant all give ∇ f(x) d≥ 0.
Lagrangian for f(x),

We add λ i c i (x) for each additional constraint, so
T
L(x, λ) = f(x) – λ 1 c 1 (x)- λ 2 c 2 (x),and λ= (λ 1 , λ 2 ) becomes the vector of Lagrange
Multipliers.
Rocko Chen
The extension of [ ∇ xL(x*, λ 1 *) = 0, for some λ 1 *≥ 0] to [λ 1 c 1 (x*)= 0] is then

[ ∇ xL(x*, λ*) = 0], for some λ*≥ 0
The inequality λ*≥ 0 means all components of λ* are non-negative.

Then we apply complementary condition (λ 1 can be strictly positive only when c 1 is active)
to both I constraints.
λ 1 *c 1 (x*)= 0, λ 2 *c 2 (x*)= 0
T
when x*= (- 8 , 0) , we have
1 0
1, ∇ c (x*)= 2 8
∇ f(x*)= 1 , ∇ c 2 (x*)= 1 ,
0
so ∇ xL(x*,λ*)= 0 when we select λ* as follows:

1/[2 8 ]
λ*= , notice how both components are positive.
1
Some feasible points are NOT solutions; let’s see their Lagrangian and its gradient at these
points.
T
At x= (- 8 , 0) , both constraints are active yet d such as
T
(-1, 0) would not work. For x= (-1, 0) the ∇ xL(x*, λ*)= 0 condition is only satisfied when
T
λ= (-1/(2 8 ), 1) , and if the first λ 1 is negative, [ ∇ xL(x*, λ*)= 0, for some λ*≥ 0] does not
satisfy.
T
At x= (1, 0) , only c 2 is active. Any small step s away from this point will continue to satisfy
c 1 (x+ s)> 0, we only need to consider c 2 and f behaviour to see if s is indeed a feasible
descent step.
The direction of feasible descent d must satisfy

T T
▼c 2 (x) d≥ 0, ▼f(x) d< 0.
1 0
Noting that ∇ f(x) = 1, ∇ c 2 (x) = 1 , we can see d= (-0.5, 0.25) satisfies
T
[ ∇ c 2 (x) T d≥ 0, ∇ f(x) T d< 0] and is then a descent direction.
To show that optimality conditions [ ∇ xL(x*, λ*)= 0, for some λ*≥ 0] and
[λ 1 *c 1 (x*)= 0, λ 2 *c 2 (x*)= 0] fail.
We note first that since c 1 > 0, we have λ 1 = 0.

Then to satisfy ∇ xL(x, λ) = 0,
We need a value for λ 2 so that ∇ f(x)- λ 2 ∇ c 2 (x)= 0.
λ 2 does not exist, so this point fails to satisfy optimality conditions.
Rocko Chen
2.3 Tangent Cones [TΩ(x*)]
Here we have,
Ω: closed convex, constraint set
x* Ω
F(x*): set of 1st order feasible directions at x*
*The earlier approach of examining first derivative of f and c i , via first order Taylor series
expansion about x to form approximation when both objective and constraints are linear,
only works when the approximation near the point x in question.
If, near x, the linearization is fundamentally different from the feasible set f(x), e.g. entire
plane while feasible set is a single point, then the linear approximation will not yield useful
info.
This is where we must make assumptions about the nature of c i , i.e. for the below,
Constraint qualifications ensure similarity of Constraint Set Ω, its linearized
approximation, in a neighbourhood of x*.
2.4 Feasible Sequence
[15] Given a feasible point x, we call {Z k k} a feasible Sequence approaching x if Z k ∈ Ω
for all k sufficiently large and Z k → x.
A local solution (x*) is where all feasible sequences approaching x* have the property that
f(Z k )≥ f(x) for all k sufficiently large, and we will derive conditions under which this
property holds.
A tangent is a limiting direction of a feasible sequence.
Definition 12.2
d: a tangent (vector) to Ω at a point x if there is a feasible sequence {Z k } approaching x
and a sequence of positive scalars {t k } with t k → 0 so that
lim
k→ ∞ (Z k -x)/t k = d
The set of all tangents to Ω at x* is called the Tangent Cone, or TΩ(x*)
Note: if d is a tangent vector with corresponding sequences {t k } and {Z k }, then replace

each {t k } by α −1 {t k }, where α>0, we get
αd TΩ(x*)
2.4.1 Linearized Feasible Directions F(x)
Given a feasible point x and active constraint set A(x), the set of linearized feasible
directions F(x) is
F(x)= d T ∇ c i (x)= 0 for all i ∈ ε,
d T ∇ c i (x)≥ 0, for all i ∈ A(x) ∩ Ι
Rocko Chen
We can see that F(x) is a cone.
Note: Definition of tangent cone does not rely on algebraic specification of Ω, only on its
geometry. The linearized feasible direction set depends on the definition of the constraint
functions c i , i ∈ ε ∪ Ι
2.4.2 Example with Tangent Cone ≠ Linearized Feasible Sets
Min f(x) = x 1 + x 2 , such that c i : x 1 2 + x 2 2 - 2= 0
About the constraint: a circle with radius of 2 , and it is near non-optimal point
x= (- 2 , 0) . T
It has a feasible sequence approaching x, defined by

⎡− 2 − 1 / k 2 ⎤
Zk = ⎢ ⎥ , t k = ||Z k - x||
⎢⎣− 1 / k ⎥⎦
d= (0, -1) T is a tangent
Note: f(x) increases as we move along Z k . f(Z k +1)> f(Z k ) for all k= 2,3,…
As f(Z k )< f(x) for k= 2,3,… so x cannot be a solution.
Another feasible sequence approaching x= (- 2 , 0) T from the opposite direction, defined

by
⎡− 2 − 1 / k 2 ⎤
Zk = ⎢ ⎥ we can see f decreases along this sequence, and tangents along this
⎢⎣1 / k ⎥⎦
sequence are d= (0, α) T .
Tangent cone at x: (- 2 , 0) T is {(0, d 2 ) T | d 2 ∈ ℜ }
Via definition
d T ∇ c i (x)≥ 0, for all i ∈ A(x) ∩ Ι
d= (d 1 , d 2 ) T ∈ F(x) if
⎡2 x1 ⎤ T ⎡d1 ⎤
0= ∇ c 1 (x) T d= ⎢ ⎥ ⎢ ⎥ = -2( 2 )d 1
⎣2 x2 ⎦ ⎣d 2 ⎦
Therefore, we get F(x) = {(0, d 2 ) T | d 2 ∈ ℜ }, so we have TΩ(x) = F(x)

Suppose the feasible set is defined instead by
Ω= {x| c 1 (x) = 0}, where c 1 (x) = (x 1 2 + x 2 2 - 2) 2 = 0
Note: the algebraic specification of Ω has changed. The vector d belongs to the linearized
feasible set if
4( x1 ^ 2 + x2 ^ 2 − 2) x1 ⎡ d1 ⎤ 0 ⎡ d1 ⎤
0= ∇ c 1 (x) T d= T
⎢d ⎥ = 0
T
⎢d ⎥
4( x1 ^ 2 + x2 ^ 2 − 2) x2 ⎣ 2⎦ ⎣ 2⎦
Rocko Chen
It looks true for all (d 1 , d 2 ) T . So we have F(x) = ℜ 2 , so for this specification of Ω, the
tangent cone and linearized feasible sets differ.
2.4.3 Example with F(x) = TΩ(x)
Min x 1 + x 2 , S.T. 2- x 1 2 - x 2 2 ≥ 0
Feasible region: On and within the circle
Solution looks pretty obvious at x= (-1, -1) T , i.e. same as the equality-constrained case but
this time we have many feasible sequences converging to any given feasible point.
E.g. from x= (- 2 , 0) T , various feasible sequences defined for the equality constrained
problem are still feasible for this problem. Infinitely many feasible sequences also converge
to
x= (- 2 , 0) T along a straight line from interior of the circle, with the form
Z k = (- 2 , 0) T + (1/k)w,
w: a vector whose first component is positive (w 1 > 0). Z k remains feasible if ||Z k ||≤ 2,
which is true when k≥ (w 1 ^2+ w 2 ^2)/[2 2 w 1 ]
We could also have an infinite variety of sequences approaching (- 2 , 0) T along a curve

from the circle’s interior.
The tangent cone of this set at (- 2 , 0) T is {(w 1 , w 2 ) T |w 1 ≥ 0}
Via definition
d T ∇ c i (x)≥ 0, for all i ∈ A(x) ∩ Ι ,
d ∈ F(x) if
− 2 x1 ⎡ d1 ⎤
0≤ ∇ c 1 (x) T d= T
⎢d ⎥ = 2 2 d 1
− 2 x2 ⎣ 2⎦
So, we have F(x) = TΩ(x) for this specification of the feasible set.
2.5 LICQ Linear Independence Constraint Qualification
[15]
This applies to the point x and active set A(x), if the set of active constraint gradients
{ ∇ c i (x), i ∈ A(x)} is Linearly Independent.
2.5.1 Constraint Qualifications
Constraint qualifications ensure similarity of Constraint Set Ω, its linearized approximation,

in a neighbourhood of x*.
Given a feasible point x, we call {Z k } a Feasible Sequence approaching x if Z k Ω for all k

sufficiently large and Z k → x.
Rocko Chen
A local solution (x*) is where all feasible sequences approaching x* have the property that
f(Z k )≥ f(x) for all k sufficiently large, and we will derive conditions under which this
property holds.
LICQ- Linear Independence Constraint Qualification,
This applies to the point x and active set A(x), if the set of active constraint gradients
{ ∇ c i (x), i ∈ A(x)} is Linearly Independent.
First Order Optimality Conditions

Given the Langrangian function:
L(x, λ)= f(x)- ∑
λi ci ( x)
i∈δ ∪Ι
if
1. x*= local solution,
2. Function f and c i are continuously differentiable,
3. LICQ holds at x*,
Then a Langrange multiplier vector λ* with components λ i *, i ∈ E ∪ I

S.T. following conditions are satisfied at (x*, λ*).
This leads to,
2.6 KarushKuhnTucker (KKT) conditions
[15]
∇ xL(x*,λ*)= 0
c i (x*)= 0 for all i ∈ E
c i (x*)≥ 0 for all i ∈ I
λ*≥ 0 for all i ∈ I
λ*c i (x*) = 0 for all i ∈ E ∪ I
For the last condition, λ*c i (x*) = 0, the relationship could be

Complementary: i.e. one of or both λ* and c i could be 0.
or
Strictly Complementary: exactly one of λ* or c i is 0
Some additionally mentioned properties. They seem pretty obvious, though worth
mentioning to clear potential ambiguities as we move on.
2.6.2 Lemma 12.2

Let x* be a feasible point, then the following are true
i. TΩ(x*) ⊂ F(x*)
ii. if LICQ is satisfied at x*, F(x*)= TΩ(x*)
2.6.3 A Fundamentally Necessary Condition

A local solution x*, feasible sequences f(Z k ) have the property:
f(Z k )≥ f(x) for all k sufficiently large.
Rocko Chen
if x* is a local solution, then:

∇ f(x*) T d≥ 0, for all d ∈ TΩ(x*)
Lemma 2.6.4
Cone K defined as K= {By+ Cw|y≥ 0},
Given any vector g ∈ R n , we have either that
g ∈ K,
or
There is a d ∈ R^n satisfying (g) T d< 0, (B) T d≥0, (C) T d= 0 but NOT BOTH.
2.6.5 Second Order Conditions
KKT conditions tell us how first derivatives of f and active constraints c i are related at
solution x*. When conditions are satisfied, move along any vector w from F(x*) either
1. increases 1st order approximation to objective function, i.e. w T ∇ f(x*)> 0,
2. or keeps the same value, i.e. w T ∇ f(x*)= 0
The 2nd derivatives of f and constraints c i play the “tiebreaking” role. E.g. for directions
w ∈ F(x*) where w T ∇ f(x*) = 0, we can’t tell if a move along this will increase or decrease f.
The 2nd conditions examine 2nd derivative terms in Taylor series expansions of f and c i to
resolve this.
Essentially, the 2nd order conditions concern the curvature of the Lagrangian function in the
“undecided” directions, where directions w ∈ F(x*) where w T ∇ f(x*) = 0.
Given solution x*, KKT conditions met.

The inequality constraint c i is strongly active or binding if
i ∈ A(x*) and λ i *> 0 for some Lagrange multiplier λ i * satisfying KKT.
c i is weakly active if
i ∈ A(x*) and λ i *= 0 for all λ* satisfying KKT.
Essentially, the 2nd order conditions concern the curvature of the Lagrangian function in the
“undecided” directions, where directions w ∈ F(x*) where w T ∇ f(x*)= 0.
2.6.6 F(x): set of linearized feasible directions

d T ∇ci ( x) = 0, forAlli ∈ Ε
Given F(x)= d| T ,
d ∇ci ( x) ≥ 0, forAlli ∈ A( x) ∩ Ι
and some Langrange multiplier vector λ* satisfying KKT conditions,

we define the Critical Cone C(x*, λ*) as follows:
C(x*, λ*)= {w ∈ F(x*)| ∇ c i (x*) T w= 0, all i ∈ A(x*) ∩ I with λ i *> 0}

Rocko Chen
Equivalently,
∇ci ( x*)T w = 0, forAlli ∈ Ε

w ∈ C(x*, λ*) ∇ci ( x*)T w = 0, forAlli ∈ A( x*) ∩ Iwithλi * > 0
∇ci ( x*)T w ≥ 0, forAlli ∈ A( x*) ∩ Iwithλi * = 0
The critical cone contains directions w that tend to adhere to active inequality constraints
even when we make small changes to the objective or equality constraints
(indices i ∈ I where λ* is positive).
From above we can see λ*= 0 for all inactive components i ∈ I\A(x*), if then follows that
w ∈ C(x*,λ*)→ λ* ∇ c i (x*) T w= 0 for all i ∈ E ∪ I (12.54)
From the first KKT condition ∇ xL(x*, λ*)= 0,

and Lagrangian definition L(x, λ)= f(x)- λi ci ( x) ∑
i∈E ∪ I
We have w ∈ C(x*, λ*)→ w T ∇ f(x*)= ∑ λ * w ∇c ( x*) = 0

i∈E ∪ I
T
i
Hence, critical cone C(x*, λ*) contains directions from F(x*) where the first derivative does
not clearly state if f will increase or decrease.
2.6.7 Example on Critical Cones
Min x1, S.T. x 2 ≥ 0, 1-(x 1 -1) ^2- x 2 ^2≥ 0
So the constraint is within & border of a circle of r=1, its centre at (1, 0).
0
Clearly, x*=
0
And its active set A(x*)= {1, 2}→ A(x*) T = ∇ c i (x*) so ∇ c 1 = 1, ∇ c 2 = 2
0
Optimal Lagrange multiplier λ*= → ∇f ( x*) = λ1∇c1 ( x*)
0.5
0 0
The gradients of active constraints at x*: and
1 2
As LICQ holds, the optimal multiplier remains unique.
And we have the Linearized Feasible Set: F(x*) = {d| d≥ 0}
0
The Critical cone: C(x*, λ*)= { | w 2 ≥ 0}
w2
2.7 Duality
[15]
given,
min f(x), subject to c(x)≥ 0,
with the Lagrangian function [L(x, λ 1 )= f(x)- λ 1 c 1 (x)]
Rocko Chen
the Lagrangian multiplier λ ∈ R ,

m
L(x, λ )= f(x)- λ T c(x)
Then the dual objective function q: R n → R is:

inf
q(λ) def L(x, λ)
x
The domain of q is a set of λ values where q is finite,

D def {λ|q(λ)> -∞}
2.7.1 Duality example
min 0.5(x 1 2 +x 2 2 ) subject to x 1 - 1≥ 0
Lagrangian: L(x 1 , x 2, λ 1 ) = 0.5(x 1 2 +x 2 2 ) - λ 1 (x 1 - 1)
λ 1 is fixed, this is a convex function of (x 1 , x) T , so the infimum (local lower bound) is
achieved when partial derivatives WRT x 1 , x 2 are zero, i.e. x 1 - λ 1 = 0, x 2= 0
Substituting the infimal values into L(x 1 , x 2, λ 1 ),

we get the dual objective
q(λ 1 )= 0.5(λ 1 ^2+ 0)- λ 1 ( λ 1 - 1)= 0.5 λ 1 ^2+ λ 1
So the dual of (Max q(λ), subject to λ≥ 0) becomes
max
- 0.5 λ 1 2 + λ 1
λ1 ≥ 0
Apparent solution: λ 1 = 1
Rocko Chen
Chapter III. Quadratic Programming
[15] General QP format:
min q(x)= 0.5x T Gx+ x T c

x
Subject to a i T x= b i i∈ E
T
a i x≥ b i i∈ I
G: symmetric n by n matrix.
E and I are finite sets of indices.
c, x, and {a i },i ∈ E ∪ I, are vectors in R n .
If G is
positive semi definite, then it is a convex QP,
positive definite, it is a Strictly convex QP,
indefinite, it is Nonconvex, more challenging due to multiple stationary points and local
minima.
3.1 Equalityconstrained QPs
form,
min q(x) def 0.5x T Gx+ x T c
x
subject to Ax= b
A: m by n Jacobian of constraints, (where m≤ n).

The rows are a i T , i ∈ E.
b: vector in R m , components are b i , i ∈ E
1st order necessary conditions for solution x*:

there is a vector λ* (Lagrange multiplier) to satisfy below system,
⎡G − A t ⎤ ⎡x *⎤ ⎡− c ⎤
⎢A ⎥ ⎢λ *⎥ = ⎢b ⎥
⎣ 0 ⎦ ⎣ ⎦ ⎣ ⎦
Expressing x* = x+ p (for computation)

x: solution estimate
p: desired step
we now get the
3.2 KarushKuhnTucker (KKT) matrix
⎡G AT ⎤ ⎡− p ⎤ ⎡ g ⎤
⎢A ⎥ ⎢ ⎥= ⎢ ⎥
⎣ 0 ⎦ ⎣λ * ⎦ ⎣ h ⎦
h= Ax- b, g= c+ Gx, p= x*- x
note
Rocko Chen
Z: n by (n-m) matrix whose columns are a basis for null space of A,

i.e. AZ= 0
KKT QP Example
min q(x)= 3x 12 + 2x 1 x 2 + x 1 x 3 + 2.5x 22 + 2x 2 x 3 + 2x 32 - 8x 1 - 3x 2 - 3x 3 ,

subject to
x 1 + x 3 = 3,
x 2 + x 3 = 0. (16.9)
G→ q’(x 1 )= 6x 1 + 2x 2 + x 3
q’(x 2 )= 2x 1 + 5x 2 + 2x 3
q’(x 3 )= x 1 + 2x 2 + 4x 3
C→ ∇ Single variable terms: (-8, -3, -3) T

A→ ∇ Constraint functions:
1+ 0+ 1= 3
0+ 1+ 1= 0
b→ Constraint pts: 3, 0
In matrix form
⎡6 2 1 ⎤ ⎡ − 8⎤
⎢ ⎥ ⎢− 3⎥ , A= ⎡1 0 1⎤ ⎡3 ⎤
G= 2 5 2 , c= ⎢0 1 1⎥ , b= ⎢0 ⎥
⎢ ⎥ ⎢ ⎥
⎢⎣1 2 4⎥⎦ ⎢⎣− 3⎥⎦ ⎣ ⎦ ⎣ ⎦
⎡G − A t ⎤ ⎡x *⎤ ⎡− c ⎤
We use ⎢ ⎥ ⎢λ *⎥ = ⎢b ⎥
⎣A 0 ⎦ ⎣ ⎦ ⎣ ⎦
⎡6 2 1 −1 0 ⎤ ⎡8 ⎤
⎢2 5 2 0 − 1⎥⎥ ⎢3 ⎥
⎢ ⎡ ⎤ ⎢ ⎥
x *
and get ⎢1 2 4 − 1 − 1⎥ ⎢λ *⎥ = ⎢3 ⎥
⎢ ⎥ ⎣ ⎦ ⎢3 ⎥
⎢1 0 1 0 0 ⎥ ⎢ ⎥
⎢⎣0 1 1 0 0 ⎥⎦ ⎢⎣0⎥⎦
x*= (2,-1,1) T
λ*= (3,-2) T
Null-space for G→ Z= (-1, -1, 1) T
Many methods currently exist to solve QPs. From this point this paper will focus exclusively
on the most “popularly applied” means via contemporary research.
Rocko Chen
3.3 Solving QP with A Modified Simplex Method
Jenson and Bard started with an examination of Karush-Kuhn-Tucker conditions, leading to
a set of linear equalities and complement constraints. [10] They then apply a modified
Simplex Algorithm for solutions.
General QP
Min f(x) = cx + 1/2x T Qx
Subject to Ax ≤ b and x ≥ 0
c: an n-dimensional row vector form of coefficients of the linear terms in the objective
function.
Q: an (n by n) symmetric matrix describing the coefficients of the quadratic terms.
x: n-dimensional column vector denoting decision variables (as in linear programming)
A: an (m by n) matrix defining constraints
b: defining constraints’ right-side coefficient column vector
The model drops constants in the equation. We assume a feasible solution exists and that the
constraint region is bounded. A global minimum exists when f(x) is strictly convex for all
feasible points.
KKT Conditions
A positive definite Q guarantees strict convexity.
Excluding non-negativity conditions, the Lagrangian function for the QP is,

L(x, μ) = cx + 1/2x T Qx + μ(Ax - b)
μ: an m-dimensional row vector.
KKT local minimum conditions follow,
L xj ≥ 0, j= 1,…, n c + x T Q + μA ≥ 0 (12a)
L μi ≤ 0, i= 1,…, m Ax – b ≤ 0 (12b)
x j L xj = 0, j= 1,…, n x T (c T + Qx + A T μ) = 0 (12c)
μ i g i (x) = 0, i= 1,…, m μ(Ax – b)= 0 (12d)
x j ≥ 0, j= 1,…, n x≥0 (12e)
μ i ≥ 0, i= 1,…,m μ≥0 (12f)
To make these conditions more manageable,

they introduced a surplus variables y ∈ R n into (12a),
and nonnegative slack variables v ∈ R m into (12b).
c + Qx + A T μ T - y = 0 and Ax – b + v = 0
T
The KKT conditions now rewritten with constants moved to the right side
Qx + A T μ T - y = -c T (13a)
Ax + v = b (13b)
x ≥ 0, μ ≥ 0, y ≥ 0, v ≥ 0 (13c)
T
y x = 0, μv = 0 (13d)
So the first two expressions have become linear equalities, the third restricts all the variables
nonnegative, and the fourth applies complementary slackness.
Rocko Chen
They then apply a simplex algorithm to solve 13a-13d, with a “restricted basis entry rule”
to treat the complementary slackness conditions (13d) implicitly. It takes the following steps,
• Let structural constraints 13a and 13b be defined by KKT conditions
• If any right-side values must always stay positive, (* -1) if need be
• Addition of an artificial variable in each equation
• The objective function becomes the sum of artificial variables
• The resulting problem goes into simplex form
They aim to find a solution to the linear program that minimizes sum of the artificial
variables with the additional requirement that the complement slackness conditions are
satisfied at each iteration. If the sum becomes zero, the solution then satisfies 13a-13d. To
accommodate 13d, rule for selecting entering variables must bear the following relationships,
x j and y j are complementary for j= 1,…,n

μ i and v i are complementary for j= 1,…,m
The entering variable will have the most negative reduced cost, provided that its
complementary variable is not in the basis or could leave the basis on the same iteration. The
algorithm ultimately provides vector x as the optimal solution and vector μ the optimal dual
variables.
Note:
According to the authors, this approach works only well when the objective function is
positive definite, while requiring computational effort comparable to a linear programming
problem with m + n constraints (where m is the number of constraints and n number of QP-
variables).
Positive semi-definite forms of the objective function could present computational

complications. A suggested remedy to overcome semi-definiteness is to add a small constant
to each of the diagonal elements of Q so that the matrix becomes positive definite. Though
the solution will not end with high accuracy, the difference could remain insignificant as long
as one keeps modifications relatively small.
An example with the Simplex Method
Minimize f(x) = -8x 1 - 16x 2 + x 12 + 4x 22

subject to x1 + x 2 ≤ 5
x 1 ≤ 3,
x 1 ≥ 0, x 2 ≥ 0
First we rewrite in matrix form.
⎡− 8 ⎤ ⎡ 2 0⎤ ⎡1 1 ⎤ ⎡5⎤
cT = ⎢ ⎥ , Q= ⎢0 8 ⎥ , A = ⎢1 0⎥ , b= ⎢3⎥
⎣− 16⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
We can see that Q is positive definite so the KKT conditions are necessary and sufficient for a
global optimum.
The linear constraints 13a and 13b take the following form,
2x 1 + μ1 + μ 2 - y1 =8
Rocko Chen
8x 2 + μ1 - y2 = 16
x1 + x2 +v 1 =5
x1 +v 2 =3
Then they add the artificial variables to each constraint and minimize the sums.
Minimize a 1 + a 2 + a 3 + a 4
subject to
2x 1 + μ1 + μ 2 - y1 +a 1 =8
8x 2 + μ1 - y2 +a 2 = 16
x1 + x2 +v 1 + a3 =5
x1 +v 2 + a4 =3
note: all variables ≥ 0 and complementary conditions.
Then a modified simplex technique yields the following iterations,

Iteration Basic Solution Objective Entering Leaving
Variables Value Variable Variable
1 a1, a 2 , a 3 , a 4 8, 16, 5, 3 32 x2 a2
2 a1, x 2 , a 3 , a 4 8, 2, 3, 3 14 x1 a3
3 a1, x 2 , x1, a 4 2, 2, 3, 0 2 μ1 a4
4 a1, x 2 , x1, μ1 2, 2, 3, 0 2 μ2 a1
5 μ 2 , x 2 , x1, μ1 2, 2, 3, 0 0
And the optimal solution: (x 1* , x *2 ) = (3, 2)

Rocko Chen
Chapter IV. SVM Mechanics
SVMs map data nonlinearly into a high (potentially infinite) dimensional feature space via a
kernel function, where data becomes easily separable linearly by a hyperplane [14]. Distance
(margin) between the points (Support Vectors) and the hyperplane are then maximized. It
results in nonlinear regressions in low dimensional space, where extrapolations (of improved
accuracy) become possible, i.e. classifications or forecasting.
4.1 General overview
We commence with a data set G=(X i , y i ) iN=1 of N data points, where X i denotes input space
(predictive variables) and y i the corresponding output (response variables). ε -SVR then
works out a function f(x) that has a maximum of ε deviation from the actually obtained y i
for all the training data while keeping f(x) as flat as possible. Therefore, we end up with an
“expected” range for extrapolation errors, making the findings more practical. [9]
X i is mapped nonlinearly into the D-dimensional space F, with corresponding output y i .

^ ^
We do the linear regression in F. We need to find a function f (t+1) = g (X) that
approximates y(t+1) based on G.
SVR approximation function,

^ D
g (X i )= ∑
i =1
ω i ф(X i ) + b where ф: R n → F, ω ∈ F
ω i : coefficients
b: threshold value
^
Dot product takes place between ω and X. g (X i ) serves as the hyperplane in F defined by
the functions ф(X) where dimensionality could get very high, possibly infinite.
A small ω would lead to improved flatness. One way to find a small ω, we could minimize the
norm, i.e. || ω||^2= <ω, ω>.
The SVR QP
The SVR becomes a minimization problem with slack variables ξ and ξ*
N
Minimize 1/2 ω 2
+C ∑
i =1
(ξ i + ξ* i ) [16]
^
subject to g (X i )+ b- y i ≤ ε + ξ i
^
y i - [ g (X i )+ b]≤ ε + ξ* i
ξ i , ξ* i ≥ 0
This is solved with the Lagrangian and transforming into a dual problem,
^ ^ N
f (t+1)= g (X)= ∑i =1
(α* i - α i )K(X,X i )+ b
α* i , α i : Lagrange Multipliers,
associated with X i ,
Rocko Chen
0≤ α* i , α i ≤ C,
N
∑
i =1
(α* i - α i )= 0
Training points with nonzero Lagrange multipliers are called Support Vectors (SV), where
the smaller the fraction of SVs, the more general the solution.
Coefficients α* i , α i are obtained by maximizing the following form subject to earlier stated
conditions,
N N
R(α* i ,α i )= ∑
i =1
y i (α* i - α i )- ε ∑i =1
(α* i + α i )
N
-1/2 ∑
i , j =1
(α* i - α i )(α* j - α j ) K(X i ,X j )
The training vectors x i are mapped into a higher (potentially infinite) dimensional space by
the kernel function Φ. Then the SVM finds a linear hyperplane separating the (support)
vectors in this space with maximal margin (distance to hyperplane). [12]
4.2 Structural Risk Minimization (SRM)
Popular financial time series models tend to “over fit”, i.e. focusing exclusively on quality of
fitting training data (empirical error), that structural risk (potential forecast error) becomes
perilously volatile.
SVMs offer uniqueness as a learning algorithm by applying SRM, i.e. they utilize capacity
control of the decision function, the kernel function and the sparsity of solutions (Huang,
Nakamori, Wang 2005). The SRM principle helps SVMs estimate functions while minimizing
generalization error, therefore making SVM classifications and SVR highly resistant to over-
fitting issues.
Utilized Risk Function [17]
Let’s say the SVR aims to estimate/learn a function f(x, λ) where X is the input space
(e.g. stock index prices together with econometric indicators). Given λ ∈ Λ, a set of abstract
parameters from an Independent Identically Distributed (i.i.d) sample set with size N.
(x 1 , y 1 ),…,(x N , y N ), x i ∈ X(R d ), y i ∈ R (1.1)
The training data set (x i , y i ) belongs to an unknown distribution P(x i , y i ).
So we look to find a function f(x, λ*) with the smallest possible value for the expected risk
(or extrapolation error) as
R[λ] = ∫ l[y, f(x, λ)]P(x, y) dxdy (1.2)

l = loss function to defined as need be.
With the probability distribution P(x, y) in (1.2) unknown, computing and minimizing R[λ]
poses an impossibility, so far. However, we do have some information of P(x, y) from the i.i.d
sample set (1.1) earlier, and it becomes possible to compute a stochastic approximation of
R[λ] by the Empirical Risk.
N
R emp [λ]= 1/N ∑
i =1
l[y i , f(x i , λ)] (1.3)
Rocko Chen
According to Yang, the law of large numbers results in empirical risk converging with respect
to the statistical expectancy. Despite this, for small sample sets, minimizing empirical risk
alone could lead to issues such as loss of accuracy or over-fitting.
To remedy issues of relatively insufficient sample size, the statistical-learning or VC (Vapnik-

Chervonenkis) theory, offers bounds on empirical risk deviations from the Expected Risk.
The standard form of Vapnik and Chervonenkis bound, with probability 1 – η,
2N η
h(ln + 1) − ln
R[λ] ≤ R emp [λ]+ ( h 4 )^0.5, ∀ λ∈ Λ (1.4)
N
h: VC-dimension of f(x, λ)
From (1.4) it becomes apparent that to achieve small expected risk, i.e. improved accuracy,
the empirical risk, ratio between VC-dimension and data points must stay relatively small. As
a decreasing function of h usually represents empirical risk, optimal values for VC-dimension
exists with respect to the given number of samples. Therefore when given a relatively limited
number of data points, a designated value for h (often controlled by free model parameters)
plays the key to superb performance.
The above led to the technique of SRM (Vapnik and Chervonenkis 1974), in an attempt to
choose the most appropriate VC-dimension.
Overall, SRM is an inductive principle to optimize the trade-off between hypothesis space
complexity and empirical error. The resulting capacity control offers minimum test error
while keeping the model as “simple” as possible.
Rocko Chen
SRM steps [11]:
• With given domain, a function is selected (e.g. nth degree polynomials, n-layer neural
networks, n-rules fuzzy logic).
• Function classes are divided into nested subsets, with respect to order of complexity.
• Empirical risk minimization occurs at each subset, i.e. general parameter selection.
SRM therefore naturally holds the potential to dramatically improve stability of forecast
errors. As financial markets characterize exceptionally noise saturated processes, consistency
in prediction errors serves as the next best thing for practical utilization.
4.3 The Loss Function
The loss function measures empirical risk. Many types exist, see some below, [17]
Type Loss Function l( δ ) Density Function ρ ( δ )
Linear ε ‐insensitive |δ | ε 1
exp(‐| δ | ε )
2(1 + ε )
Laplacian | δ | 1/2exp(‐| δ |)
Gaussian ½δ 2
1 δ2
exp(‐ )
2π 2
Huber’s Robust ⎡1 /(2σ ) * δ 2 , if | δ |≤ σ ⎧ δ2
⎢ ⎪⎪exp(− ), if | δ |≤ σ
⎣| δ | −σ / 2, otherwise Α⎨ 2σ
σ
⎪exp( − | δ |), otherwise
⎪⎩ 2
Polynomial 1/d| δ | d d
exp(‐| δ | d )
2Γ(1 / d )
The target values y are generated by an underlying functional dependency f plus additive
noise δ of density ρ δ . Then minimizing R emp coincides with the follow,
l(f(x), y) = -log[p(y|x, f)]
An example of the Square Loss function. It corresponds to y being affected by Gaussian,

normal noise.
l 2 (y, f(x)) = ½ (y- f(x)) 2 or l 2 ( δ )= 1/2 δ 2

(1.7)
The squared loss however does not always result as the best choice. The above ε -insensitive
loss function is also popular, and applied for ε -SVR applied later in this study.
Rocko Chen
⎧0, if | y − f ( x) |< ε
l ε (y, f(x))= ⎨ (1.8)
⎩| y − f ( x) | −ε , otherwise
The ε -insensitive function do not present output errors as long as the data points remain
within the range of ± ε . Therefore increasing ε values would likely reduce the number of
support vectors, at extreme ends it may result in a constant regression function. Therefore,
the loss function indirectly affects the complexity and generalization of models.
4.4 ε SVR
With the ε -insensitive function advantages, ε -SVR has become quite helpful for regression
type of data sets. It aims to find a function f with parameters w and b by minimizing the
N
following regression risk, R reg (f)= ½<w, w> + C ∑
i =1
l(f(x i ),y i ) (1.5)
C: Cost of Error, a trade-off term

<, >: inner product, the first term can be seen as the margin, measuring VC-dimension
The Euclidean norm, <w, w> measures the flatness of function f, so minimizing <w, w>
makes the objective function as flat as possible.
f(x, w, b) = <w, Φ(x)> + b (1.6)
Minimization of (1.5) is equivalent to the following constrained minimization problem,

N
min Y(w, b, ξ (*) ) = ½<w, w> + C ∑
i =1
l(ξ i + ξ *i ) (1.9)
subject to y i - (<w, Φ(x i )> + b) ≤ ε + ξ i ,

(<w, Φ(x i )> + b) - y i ≤ ε + ξ *i , (1.10)
ξ i(*) ≥ 0
i = 1,…,N
(*): both variables with and without the asterisks
ξ i and ξ *i : up and down errors for sample (x i , y i ) respectively
4.5 Standard Method for SVR QP Solutions
The standard optimal solution for (1.9), the function f as [17]
f(x, w, b) = <w, Φ(x)> + b (1.6)
involves construction a dual problem of the primal minimization problem by Lagrange

Method, then maximize the dual. Therefore, a duel QP is worked out,
N N
min Q(α (*) ) = 1/2 ∑ ∑ i =1 j =1
(α i - α *i )(α j - α *j )<Φ(x i ), Φ(x j )>
N N
+ ∑
i =1
(ε - y i ) α i + ∑
i =1
( ε + y i ) α *i (1.11)
N
subject to ∑i =1
( α i - α *i ) = 0, α i(*) ∈ [0, C] (1.12)
After solving this QP, we get the objective function,

Rocko Chen
N
f(x)= ∑
i =1
(α i - α *i )<Φ(x i ), Φ(x)>+ b
α and α*: Lagrange multipliers used to move f towards y.
With the QP solved, we still need to find value of b. KKT conditions help in this case,
α i ( ε + ξ i - y i + <w, Φ(x i )> + b) = 0
α *i ( ε + ξ *i + y i - <w, Φ(x i )> - b) = 0,
and
(C - α i ) ξ i = 0
(C - α *i ) ξ *i = 0,
The above presents some interesting conclusions.

α i(*) =C means that samples (x i , y i ) lie outside the ε margin.
α i α *i = 0, so this pair is never non-zero simultaneously.
α i(*) ∈ (0, C) corresponds to (x i , y i ) on the ε margin, then to find b
⎧ yi − < w, Φ ( xi ) > −ε , forα i ∈ (0, C )
b= ⎨
⎩ yi − < w, Φ ( xi ) > +ε , forα i ∈ (0, C )
*
The sample points (x i , y i ) with nonzero α i or α *i are the Support Vectors.
4.6 The Decomposition Method
[8]
Real life QPs however do not come in nice forms and require heavy computing power to
solve. The Decomposition Method has become a general format in solving SVM QPs via an
iterative process.
General format for C-SVC, ε-SVR, and one-class SVM

1/2α Qα+ p α
subject to y α= Δ
0≤ α ≤ C, t=1,…,l
where y = ±1, t=1,…,l (C-SVC and 1-class SVM are already in form)
For ε-SVR, we consider the below reformulation,

½[α ,( α*) ] + [εe + z ,εe - z ]
subject to y = 0, 0≤ α , α ≤ C, t=1,…,l (3.2)
y: 2l by 1 vector with y = 1 t= 1,…, l
y = -1 t= l+1,…, 2l
This method modifies a subset of α per iteration. This subset, working set B, leads to a small
sub-problem to be minimized each iteration. Then in each iteration we solve a simple 2-
variable problem.
Algorithm 1, Sequential Minimal Optimization

SMO-type Decomposition method, Fan et al. 2005
1. Find α as initial feasible solution. We set k= 1
Rocko Chen
2. Find a 2-element working set B= {i,j} by Working Set Selection (WSS) defines
N {1,…, l}\B and α and α to be sub-vectors of α corresponding to B and N respectively.
3. If α K + K - 2K > 0
solve this sub-problem with variable α :
½[ α α ] + (PB+ Q α )
Subject to 0≤ α , α ≤ C
y α + y α = Δ- y α (3.3)
If else, solve
½[ α α ] + (PB+ Q α )
+ (( α - α ) + ( α - α ) ) (3.4)
4. Set α as the optimal solution of (3.3) and α α .
Set k+1→ k and repeat step 2
So B is updated with eat iteration. If α ≤ 0, (3.3) is a concave problem, and we use the (3.4)
convex modification.
Stopping criteria and WSS for C-SVC, ε-SVR, and One-class SVM
KKT optimality condition says that a vector α is a stationary point of the general form if and
only if there’s a number b and two nonnegative vectors λ and μ such that
f(α)+ by= λ- μ
λ α = 0, μ (C- α )= 0, λ ≥ 0, μ ≥ 0, i= 1,…, l
where f(α) Q α+ P is gradient of f(α), rewritten:
f(α) + by ≥ 0 if α < C
f(α) + by ≥ 0 if α > 0
Since y = ±1, by defining

I (α) {t| α < C, y = 1 or α > 0, y = -1} and
I (α) {t| α < C, y = -1 or α > 0, y = 1}
a feasible α is a stationary point of the general form (3.1) if and only if
m(α) ≤ M(α)
where
m(α) - y f(α)
and
M(α) - y f(α)
then we have the following stopping condition:

m(α )- M(α )≤ ε
About the selection of Working Set B, we consider the following:
WSS 1
1. For all t, s, define
a K + K - 2K ,b -y f(α ) + y f(α ) > 0
and
select
i arg {- y f(α ) | t I (α )}
j arg {- | t I (α ), - y f(α ) < y f(α ) }
2. Return B= {i,j}
Rocko Chen
4.7 The Kernel Function
[3]
The name Kernel comes from “integral operator theory”. They contain mostly supporting
theories on relations between kernels and associated feature spaces.
SVR applies the mapping function Φ to solve non-linear samples, such as financial data sets.
It maps the input space X into a new space Ω = { Φ(x) | x ∈ X}, i.e. the mapping function
makes x = (x 1 ,…,x N ) become Ω= Φ(x) = [Φ 1 (x),…, Φ N (x)]. Then in the feature space Ω, we
can obtain a linear regression function.
In the standard SVR QP,

N N
min Q(α (*) ) = 1/2 ∑ ∑
i =1 j =1
(α i - α *i )(α j - α *j )<Φ(x i ), Φ(x j )>
N N
+ ∑
i =1
(ε - y i ) α i + ∑
i =1
( ε + y i ) α *i
N
subject to ∑
i =1
( α i - α *i ) = 0, α i(*) ∈ [0, C]
we notice that the objective function contains an inner product of the mapping function
Φ(x). The inner product lets us specify a kernel function WITHOUT considering the
mapping function or the feature space explicitly. With this advantage, we then create the
Kernel function,
K(x, z) = < Φ(x), Φ(z)>
Since feature vectors are not expressed explicitly, the number of inner product computations
do not always reflect the number of features in exact proportions. The kernel makes it
possible to map data implicitly into a feature space for training, while evading potential
problems in evaluating feature map. The Gram/kernel matrix remains the only information
applied in training sets.
The Kernel function usually satisfies Mercer’s Theorem,
Mercer's Theorem is a representation of a symmetric positive-definite function on a square

as a sum of a convergent sequence of product functions. This theorem (presented in James
Mercer (1909). "Functions of positive and negative type and their connection with the
theory of integral equations". Philos. Trans. Roy. Soc. London 209. ).
There are 4 basic types of Kernel functions [4]

1. Linear: K(x i ,x j )= x i T x j
2. Polynomial: K(x i ,x j )= ( γ x i T x j + r) d , γ > 0
3. Radial Basis Function (RBF): K(x i ,x j )= exp(- γ xi − x j 2
), γ > 0
4. Sigmoid: K(x i ,x j )=tanh( γ x i T x j + r)
Here γ , r and d are kernel parameters.
The RBF Kernel usually comes as a reasonable first choice. It non-linearly maps samples into
higher dimensional space, so it would likely handle financial (non-linear) data better than
the linear kernel. At the same time, it does not become as complex as the Polynomial or run
into validity issues from the Sigmoid type.
Rocko Chen
Note: the linear and sigmoid kernels have shown to behave like RBF given certain
parameters (Keerthi and Lin 2003, Lin and Lin 2003)
Also, the polynomial kernel may become too complex as it has more hyper-parameters than
RBF. The RBF kernel has less numerical difficulties, e.g. 0< k ij ≤ 1 for RBF, where for
polynomial kernel as degree increases values may go to infinity ( γ x i T x j + r> 1) or zero
( γ x i T x j + r< 1).
Lastly, the sigmoid kernel isn’t valid in some circumstances.

If number of features is very large, we may have to use Linear.
4.8 Cross Validation and Gridsearch
This is to find the best values of (C, γ ) for optimal predictions. [4]
v-fold cross-validation, we first divide training set into v subsets of equal size. Each subset is
then tested using the classifier trained on the other (v- 1) subsets. So, each instance of the
whole training set is predicted once and the cross-validation accuracy is the data% that are
correctly classified.
This process could prevent the over-fitting problem.
Grid-search: pairs of (C, γ ) are tried and one with best cross-validation is picked, and it is
completed.
Rocko Chen
Chapter V. Empirical Analysis and Findings
5.1 About Financial Time Series
The financial markets do not function arbitrarily at random. Just as natural resources are
limited, so does liquidity of financial instruments over the exchanges [2]. Supply and
demand drive price moves, i.e. basic economics. To study prices of the immediate future
then, one must investigate forces affecting market supply or demand.
Many researchers apply historical response variable (predicted output) as also the exclusive
input, implying that financial time series are dependent and stationary in nature. This is
fundamentally flawed as institutions (hedge funds, investment banks, George Soros, etc.)
who execute price-moving trades largely make decisions with respect to economic conditions
exclusively.
Yes financial time series contain quite a bit of noise, making precise value forecasts
challenging. Short term intra-day price change predictions require deep analysis of market
depth (bid/ask volume changes at market-maker, specialist levels and etc.) [1].
Notwithstanding the foregoing, empirical evidence suggests noteworthy relations between
financial instruments and economic factors over longer time frames.
5.2 Analysis Setup
This study aims to forecast ASX200 stock index value (range) and directional bias. The
predictions result from interpretations of several independent, significantly correlated
financial time series.
Historically, equity (and commodity, real estate) markets often respond to inter-market
interactions involving factors such as interest rates, political events, traders’ expectations,
and etc. Inflationary, liquidity, sentiment measures independent from ASX200 (the response
variable) then make viable predictive inputs. This study therefore commences on a
fundamentally sound structure.
5.2.1 General idea
The training data set (X i , Y i +1 ) holds daily closing prices of and derived quantities thereof.
Y i +1 : Set of output/response variables applied for training at day (i +1)
Xi : Set of input variables matched at day (i)
i = 1,…,N
N= 1,707
The initial training set (X i , Y i +1 ) starts from 31/10/01 and completes at the week ending
Mar. 08.
The training model allows SVR to create a linear model in high dimensional space matching
input sets with output values of next trading day.
Once training completes, we can apply the SVR designed regression model on testing data
X t to make next-day forecasts of Y t +1 .

5.2.2 Training Output Selection (Y i +1 )
Yahoo Finance serves as the (free) data provider. The following are the 1-day forward
response variables,
Rocko Chen
y i +1 : (Ticker Symbol: ^axjo) This is the ASX200 closing price for the day (i+1),
yr: ASX200 1-day return for day (i+1). The formula, y r = y i +1 /y i - 1. So it will appear in
the form of a percentage.
5.2.3 Training Input Selection (X i )
While numerous economic factors contribute to effective valuation of global equity indexes
(including the ASX200), the nature of this paper focuses on the mathematical potential of
SVR; the predictive variables then shall stay relatively minimal, the few selected however
hold significant correlations to global equity markets.
As globalization descends, global financial markets have developed highly positive

correlations with one another. Therefore it makes sense to utilize American financial
quantities for availability, i.e. American stock/derivative exchanges (NYSE, ISE, or CBOE)
offer a vast amount of free information. The ASX does not.
The predictive time series are acquired from ratesfx.com and Yahoo Finance and listed
below,
1. AUD: The value of 1 AUD (Australian Dollar) in USD (US Dollar). The exchange rate
plays an important role, as arbitrageurs, particularly institutional program-trading
machines exploit price discrepancies between international exchanges. Program-
trading deals with large volumes, often enough to have significant price moving
impact. [7]
2. VIX: (Ticker Symbol: ^vix) The CBOE (Chicago Board of Options Exchange)
Volatility Index. This index reflects the average implied volatility of near-the-money
S&P500 index options. As the S&P500 remains the most prevalently referenced
American stock index, the VIX reflects “expected” market volatility of the immediate
future and thereby sentiment of American traders.
3. GOX: (Ticker Symbol: ^gox) The CBOE Gold Index. Throughout the ages, gold has
always served well as an instrument of fixed intrinsic value, i.e. insignificant/no
depreciation nor real added value. Therefore, gold price presents a near-perfect
indication of real rate of inflation.
1. V 1 : 1-day historical change in the VIX. Formula: (VIX i /VIX i −1 - 1)

2. V 5 : 5-day historical change in the VIX. Formula: (VIX i /VIX i −5 - 1)
3. G 1 : 1-day realized return in the GOX. Formula: (GOX i /GOX i −1 - 1)
4. G 5 : 5-day realized return in the GOX. Formula: (GOX i /GOX i −5 - 1)
The derived changes in VIX and GOX should reflect investor sentiment and inflationary
concerns respectively. As the speed (derivative) of volatility and rate of inflation appear
stationary, historically referenced, these measures should contribute as significant predictive
variables.
5.2.4 Testing Variables
Xt : All of the training input variables on the day (t).
Z t +1 : The forecast set ASX200 index (z t +1 ) and 1-day return (z r ), at (t+1).
Test data for (z t +1 ) ranges from 3/9/08 to the week ending 9/4/08.
Test data for z r ranges from 10/12/08 to week ending 9/4/08.
Rocko Chen
5.2.5 Error Analysis
εc: (z t +1 ) test error. Formula (ε c = y t +1 /z t +1 -1)
εr: (z r ) test error. Formula (ε r =y r (t)/z r (t)-1)

5.2.6 SVR Adaptive Training Strategy
Since financial time series do not exhibit stationary nature, distant extrapolations usually
present inconsistent, unreliable performance. To remedy this, for every y t +1 , all training data
adjusts to complete at day (t), thereby adapting with each sequential test.
Example,
Training set (X 1 ,…,X 100 , Y 2 ,…,Y 101 ) leads to test variables (X 101 , Z 102 ), then the next
adapted training set (X 2 ,…,X 101 , Y 3 ,…,Y 102 ) leads to test variables (X 102 , Z 103 ).
This slight tweak of the machine learning process resulted in significantly more contained
Z t +1 test errors. It lets the SVR to continuously adjust and learn from the daily changes in x i .
The extrapolation then remains always one time step (i.e. one business day) forward, thereby
eliminating nonstationarity-related testing error.

5.2.7 Applied Software
Many current SVM researchers have created software toolsets for Matlab. LS-SVM (for Least
Squared SVM) turned out the most user-friendly and therefore applied in this study. The
adaptive training modification however makes it require a bit more legwork as it does not
carry this capability by default.
For test error and associated analysis, Minitab does the job adequately. Excel hands the raw
data.
Rocko Chen
5.3 Empirical Findings
Below entails some statistical analysis of SVR forecast results and errors.
ASX200 actual performance vs. ε c & ε r
With ε c at such similar pace with actual ASX200, we could potentially exploit the index price
moves if ε c displays likelihood of remaining stationary.
While ε r appears random, it does appear fairly contained.

Rocko Chen
5.3.1 Cross Correlation Analysis
⎧TimeSeries1 : yt
⎨
⎩TimeSeries 2 : z t
A few points of interest,

Correlation between (y, z): 38.56% at Lag = 0.
Correlation between (y, z): 54.83% at Lag = 29.
With correlation significantly positive as lag escalates, particularly at roughly 29 time steps,
the ASX200 index seems to “follow” the SVR derived extrapolation value. A negative
correlation with respect to negative lag suggests an equivalent likelihood. An analysis of the
error terms could support this idea.
Rocko Chen
⎧TimeSeries1 : y r
⎨
⎩TimeSeries 2 : z r
Correlation between (y r , z r ): 9.7% at Lag = 0.
It appears that an edge, while not significantly large, still exists for the 1-day return forecast.
The apparently spurious correlation values as absolute Lag increases proposes that z r should
be applied for the associated day exclusively.

5.3.2 Normality Tests
These tests offer a feel for the error distributions, as consequent analysis depend on these
conclusions.
(ε c ): Closing Price Valuation Error Normality Test

Rocko Chen
The ASX200 closing price valuation error (ε c ) distribution does not appear normal,
therefore a nonparametric analysis is required to analyze the data.
(ε r ): 1-day Return Forecast Error Normality Test
Interestingly, the 1-day return forecast error does (likely) resemble a normal distribution.
Therefore a parametric approach viably follows below.
Rocko Chen
5.3.3 Error Distributions
Let us look at the distributions and perhaps get a feel for their behaviour.
(ε c )
Applying the nonparametric method, Wilcoxon Signed Rank CI: Ec

Confidence
Estimated Achieved Interval
N N* Median Confidence Lower Upper
Ec 152 2 -0.1492 95.0 -0.1685 -0.1223
It appears largely negative, as we have had so much panic motivated selling since late 2007.
Interestingly, (ε c ) did not present any considerable outliers despite jumps in market
volatility throughout the test period.
Rocko Chen
(ε rt )
Applying the parametric method, Descriptive Statistics: Er

Variable N N* Mean SE Mean StDev Minimum Q1 Median
Er 83 71 0.00061 0.00172 0.01565 -0.04245 -0.00951 0.00327
Variable Q3 Maximum Skewness Kurtosis

Er 0.01076 0.03427 -0.41 0.18
In spite of the relatively small sample, the negative skew and definite kurtosis present
quantities concurring with actual equity market behavior. Negative skew reveals the nature
for stock prices to drop harder than to grow, partly as a result of inherent credit risk
associated with each exchange listed entity. A positive kurtosis coincides with the way
financial instrument values tend to move sporadically, as volatility, traders’ sentiment does
not stand still (evidenced empirically).

Rocko Chen
5.3.4 The Gold Connection
On a coincidental note, the AXJO shows a considerably high positive correlation with the
GOX (see below graph, y= correlation, x= i), dating from Oct. 01 to Feb. 09.
GOX, AXJO Correlation
1.50
1.00
0.50
0.00
0 500 1000 1500 2000
-0.50
-1.00
The correlation suggests that the ASX200 index largely adjusts with the real rate of inflation,
therefore no real significant growth in value has occurred for it in the past decade roughly. It
also implies that gold remains a significant factor in pricing of equity indexes.
5.4 Test Conclusions
SRM (Structural Risk Minimization) has really shined throughout the experiment. Despite
recent shocks in global financial economics, the SVR derived quantities maintained a fairly
stable bound of errors. This means practicality for the professional traders.
Having an accurate forecast of price ranges definitely helps, it makes it possible to literally
“buy low sell high”. Though not quite the Holy Grail, SVR offers promising means to exploit
economic inefficiencies, perhaps opening doors to a new frontier.
Rocko Chen
Reference
[1] Almgren, R., Thum, C., Hauptmann, E., Li, H. (2006). Equity market impact (Quantitative
trading, 2006/14). University of Toronto, Dept. of Mathematics and Computer
Science; Citigroup Global Quantitative Research, New York.
[2] Boucher, M. (1999). The Hedge Fund Edge. New York: John Wiley & Sons, Inc.
[3] Cao, L., Tay, F. (2001) Financial Forecasting Using Support Vector Machines (Neural
Comput & Applic (2001)10:184-192). Dept. of Mechanical and Production Engineering,
National University of Singapore, Singapore.
[4] Chang, C., Lin, C. (2009). LIBSVM: a Library for Support Vector Machines. Dept. of
Computer Science, National Taiwan University, Taipei.
[5] Chen, P., Fan, R., Lin, C. & Joachims, T. (2005) . Working Set Selection Using Second
Order Information for Training Support Vector Machines (Journal of Machine Learning
Research 6 (2005) 1889-1918). Department of Computer Science, National Taiwan
University, Taipei.
[6] Claessen, H., Mittnik, S. (2002). Forecasting Stock Market Volatility and the Informational
Efficiency of The DAX-index Options Market Johann Wolfgang Goethe-University,
Frankfurt.
[7] Dubil, R. (2004). An Arbitrage Guide To The Financial Markets. West Sussex, England:
John Wiley & Sons, Ltd.
[8] Glasmachers, T., Igel, C. (2006). Maximum-Gain Working Set Selection for SVMs (Journal
of Machine Learning Research 7 (2006)1437-1466). Ruhr-University, Bochum Germany.
[9] Huang, W., Nakamori, Y. & Wang, S. (2004). Forecasting stock market direction with
support vector machine (Computers & Operations Research 32 (2005) 2513-2522) School of
Knowledge Science, Japan Advanced Institute of Science and Technology; Institute of
Systems Science, Academy of Mathematics and System Sciences, Chinese Academy of
Sciences, Beijing.
[10] Jenson, P., Bard, J. (2008). Operations Research Models and Methods. New York: Wiley,
John & Sons, Inc.
[11] Joachims, T. (1998). Making Large-Scale SVM Learning Practical. Cambridge, USA: MIT
Press.
[12] Kecman, V. (2001). Learning from data, Support vector machines and neural networks.
Cambridge, USA: MIT Press.
Rocko Chen
[13] Kumar, M., Thenmozhi, M. (2005). Forecasting Stock Index Movement: A Comparison of
Support Vector Machines And Random Forest. Dept. of Management Studies, Indian
Institute of Technology, Chennai.
[14] Li, B., Hu, J. & Hirasawa, K. (2008). Financial Time Series Prediction Using a Support
Vector Regression Network. Graduate School of Information, Waseda University, Japan.
[15] Nocedal, J., Wright, S. (1999). Numerical Optimization. New York: Springer-Verlag.
[16] Smola, A., Scholkopf, B. (2003). A Tutorial on Support Vector Regression (NeuroCOLT
Technical Report TR-98-030). RSISE, Australian National University, Canberra, Australia &
Max-Planck-Institut fur biologische Kybernetik, Tubingen, Germany.
[17] Yang, H. (2003). Margin Variations in Support Vector Regression for the Stock Market
Prediction. Dept. of Computer Science & Engineering, The Chinese Univerisity of Hong
Kong, Hong Kong.

SVR FinancialForecasting RChen 1.3.3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SVR FinancialForecasting RChen 1.3.3

Uploaded by

Copyright:

Available Formats

Rocko Chen

Financial Forecasting With Support Vector Regression (2009) 1

School of Computing & Mathematical Sciences,

30th of April, 2009

Neil Binnie, the overseer, for his understanding and assistance.

Peter Watson, our Programme Leader, for the unremitting encouragement.

Chapter I. Introduction ........................................................................................................ 6

Chapter II. Fundamentals of Optimization........................................................................... 7

2.1. Optimization General Structure ....................................................................................... 7

2.2 The Lagrangian Function .................................................................................................. 7

2.2.1 Example with Lagrangian Function............................................................................ 7

2.2.2 Example with half-disk region.................................................................................... 8

2.3 Tangent Cones [TΩ(x*)] .................................................................................................. 10

2.4 Feasible Sequence ............................................................................................................ 10

2.4.1 Linearized Feasible Directions F(x) .......................................................................... 10

2.4.2 Example with Tangent Cone ≠ Linearized Feasible Sets.......................................... 11

2.4.3 Example with F(x) = TΩ(x)........................................................................................12

2.5 LICQ- Linear Independence Constraint Qualification ....................................................12

2.6 Karush-Kuhn-Tucker (KKT) conditions ..........................................................................13

2.6.5 Second Order Conditions...........................................................................................14

2.6.7 Example on Critical Cones .........................................................................................15

2.7 Duality ...............................................................................................................................15

2.7.1 Duality example ..........................................................................................................16

Chapter III. Quadratic Programming ....................................................................................17

3.1 Equality-constrained QPs .................................................................................................17

3.2 Karush-Kuhn-Tucker (KKT) matrix.................................................................................17

3.3 Solving QP with A Modified Simplex Method .................................................................19

Chapter IV. SVM Mechanics ................................................................................................. 22

4.1 General overview .............................................................................................................. 22

4.2 Structural Risk Minimization (SRM) .............................................................................. 23

4.3 The Loss Function............................................................................................................ 25

4.4 ε -SVR .............................................................................................................................. 26

4.5 Standard Method for SVR QP Solutions ......................................................................... 26

4.6 The Decomposition Method ............................................................................................ 27

4.7 The Kernel Function ........................................................................................................ 29

4.8 Cross Validation and Grid-search ................................................................................... 30

Chapter V. Empirical Analysis and Findings .......................................................................31

5.1 About Financial Time Series .............................................................................................31

5.2 Analysis Set-up..................................................................................................................31

5.2.1 General idea ................................................................................................................31

5.2.2 Training Output Selection (Y i +1 ) ...............................................................................31

5.2.3 Training Input Selection (X i )................................................................................... 32

5.2.4 Testing Variables....................................................................................................... 32

5.2.5 Error Analysis............................................................................................................ 33

5.2.6 SVR Adaptive Training Strategy............................................................................... 33

5.2.7 Applied Software ....................................................................................................... 33

5.3 Empirical Findings........................................................................................................... 34

5.3.1 Cross Correlation Analysis ........................................................................................ 35

5.3.2 Normality Tests ......................................................................................................... 36

5.3.3 Error Distributions.................................................................................................... 38

5.3.4 The Gold Connection ................................................................................................ 40

5.4 Test Conclusions .............................................................................................................. 40

Some interesting accuracy comparisons [14]

Chapter II. Fundamentals of Optimization

Recalling ∇ L(x, λ 1 )= ∇ f(x)- λ 1 ∇ c 1 (x),

x lies strictly inside circle: c 1 (x)> 0

s = -α ∇ f(x), α: positive scalar sufficiently small.

We expect a direction d of 1st order feasible descent to satisfy

However no such direction could exist at

Lagrangian for f(x),

The extension of [ ∇ xL(x*, λ 1 *) = 0, for some λ 1 *≥ 0] to [λ 1 c 1 (x*)= 0] is then

The inequality λ*≥ 0 means all components of λ* are non-negative.

so ∇ xL(x*,λ*)= 0 when we select λ* as follows:

The direction of feasible descent d must satisfy

The extension of [ ∇ xL(x, λ 1 ) = 0, for some λ 1 ≥ 0] to [λ 1 c 1 (x)= 0] is then

The inequality λ≥ 0 means all components of λ are non-negative.

so ∇ xL(x,λ)= 0 when we select λ* as follows:

For the last condition, λc i (x) = 0, the relationship could be

C(x, λ)= {w ∈ F(x)| ∇ c i (x) T w= 0, all i ∈ A(x) ∩ I with λ i > 0}

w ∈ C(x,λ)→ λ* ∇ c i (x*) T w= 0 for all i ∈ E ∪ I (12.54)

From the first KKT condition ∇ xL(x, λ)= 0,

We have w ∈ C(x, λ)→ w T ∇ f(x)= ∑ λ w ∇c ( x*) = 0