Professional Documents
Culture Documents
Momentum-Strategies PDF
Momentum-Strategies PDF
Master thesis
Momentum Strategies:
From novel Estimation Techniques to
Financial Applications
Author: Supervisor:
Tung-Lam Dao Prof. Thierry Roncalli
Acknowledgments ix
Confidential notice xi
Introduction xiii
i
2.4.1 Microstructure effect . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.2 Two time-scale volatility estimator . . . . . . . . . . . . . . . 52
2.4.3 Numerical implementation and backtesting . . . . . . . . . . 55
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Conclusions 113
ii
C Appendix of chapter 3 125
C.1 Dual problem of SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 125
C.1.1 Hard-margin SVM classifier . . . . . . . . . . . . . . . . . . . 125
C.1.2 Soft-margin SVM classifier . . . . . . . . . . . . . . . . . . . . 126
C.1.3 ε-SV regression . . . . . . . . . . . . . . . . . . . . . . . . . . 127
C.2 Newton optimization for the primal problem . . . . . . . . . . . . . . 128
C.2.1 Quadratic loss function . . . . . . . . . . . . . . . . . . . . . 128
C.2.2 Soft-margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . 129
iii
List of Figures
v
2.21 Likehood function of high-low estimators versus filtered parameter β 47
2.22 Likehood function of high-low estimators versus effective moving window 48
2.23 IGARCH estimator versus moving-average estimator for close-to-close
prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.24 Comparison between different IGARCH estimators for high-low prices 49
2.25 Daily estimation of the likehood function for various close-to-close
estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.26 Daily estimation of the likehood function for various high-low estimators 50
2.27 Backtest for close-to-close estimator and realized estimators . . . . . 51
2.28 Backtest for IGARCH high-low estimators comparing to IGARCH
close-to-close estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.29 Two-time scale estimator of intraday volatility . . . . . . . . . . . . . 56
vi
List of Tables
vii
Acknowledgments
During the six months unforgettable in the R&D team of Lyxor Management, I have
experienced and enjoyed every moments. Apart from all the professional experiences
that I have learnt from everyones int the department, I did really appreciate the
great ambiance in the team which motivated me everyday.
I would like first to thank Thierry Roncalli for his supervision during my stay in
the team. I did not ever imagine that I could learn so many interesting things within
my internship without his direction and his confidence. Thierry has introduced me
the financial concepts of the asset management world in a very interactive way. I
would say that I have learnt finance in every single discussion with him. He taught
me how to combine learning and practice. For the professional experiences, Thierry
has help me to fill the lag in my financial knowledges by allowing me to work on
various interesting topics. He made me confident to present my understanding on
this field. For the daily life, Thierry has shared his own experiences and teach me as
well how to adapt to this new world.
I would like to thank Nicolas Gaussel for his warming reception in Quantitative
management department, for his confidence and for his encouragements during my
stay in Lyxor. I have a chance to work with him on a very interesting topic concerning
the CTA strategy which plays an important role in asset management. I would like
to thank Benjamin Bruder, my nearest neighbor, for his guide and his supervision
along my internship. Informally, Benjamin is almost my co-advisor. I must say that
I owe him a lot for all of his patience in every daily discussion in order to teach me
and to work out many questions coming up to my projects. I am really graceful for
his humorist quality which warm up the ambiance.
For all members of the R&D team, I would like to express my gratitude to them
for their helps, their advices and everything that they shared with me during my stay.
I am really happy to be one of them. Thank Jean-Charles for your friendship, for
all daily discussions and for your support for all initiatives in my projects. A great
thank to Stephane who always cheer up all the breaks with his intelligent humor. I
would say that I have learnt from him the most interesting view of the “Binomial
world” . Thank Karl for your explanation to your macro-world. Thank Pierre for all
your help on data collection and your passion in all explanation such as the story of
“Merrill lynch’s investment clock”. Thank Zelia for very stimulated collaboration on
my last project and the great time during our internship.
For all persons in the other side of the room, I would like to thank Philippe
Balthazard for his comments on my projects and his point of view on financial
aspects. Thank Hoang-Phong Nguyen for his help on data base and his support
during my stay. There are many other persons that I have chance to be in interaction
with but I could not cite here.
x
Confidential notice
This thesis is sujected to confidential researchs in the R&D team of Lyxor Asset
Management. It is divided into two main parts. The first part including three first
chapers 1,2 and 3 consists of applications of novel estimation techniques for the
trend and the volatility of financial time series. We will present the main results
in detail together with a publication in the Lyxor White Paper series. The second
part concerning the analysis in the risk-return framework (see The Lyxor White
Paper Series, Issue #7, June 2011) of the CTA performance will be skipped due the
confidentiality. Only a brief introduction and the final conclusion of this part (chaper
4) will be presented in order to sketch out the main features.
Within the internship in the Research and Development team of Lyxor Asset Man-
agement, we studied novel technologies which are applicable on asset management.
We focused on the analysis of some special classes of momentum strategies such as
the trend-following strategies or the voltarget strategies. These strategies play a
crucial role in the quantitative management as they pretend to optimize the benefit
basing on exploitable signals of the market inefficiency and to limit the market risk
via an efficient control of the volatility.
The objectives of this report are two-fold. We first studied some novel tech-
niques in statistic and signal treatment fields such as trend filtering, daily and high
frequency volatility estimator or support vector machine. We employed these tech-
niques to extract interesting financial signals. These signals are used to implement
the momentum strategies which will be described in detail in every chapters of this re-
port. The second objective concerns the study of the performance of these strategies
based on the general risk-return analysis framework (see B. Bruder and N. Gaussel
7th White Paper, Lyxor). This report is organized as following:
We next review in the second chapter various techniques for estimating the volatil-
ity. We start by discussing the estimators based on the range of daily monitoring data
then we consider the stochastic volatility model in order to determine the instanta-
neous volatility. At high trading frequency, the stock prices are fluctuated by an
additional noise, so-called the micro-structure noise. This effect comes from the bid-
ask spread and the short time scale. Within a short time interval, the trading price
does not reflect exactly the equilibrium price determined by the “supply-demand”
but bounces between the bid and ask prices. In the second part, we discuss the
effect of the micro-structure noise on the volatility estimation. It is a very important
topic concerning a large field of “high-frequency” trading. Examples of backtesting
on index and stocks will illustrate the efficiency of considered techniques.
xiv
Chapter 1
1.1 Introduction
Trend detection is a major task of time series analysis from both mathematical and
financial point of view. The trend of a time series is considered as the component
containing the global change which is in contrast to the local change due to the
noise. The procedure of trend filtering concerns not only the problem of denoising
but it must take into account also the dynamic of the underlying process. That
explains why mathematical approaches to trend extraction have a long history and
this subject still gives a great interest in the scientific community 1 . In an investment
perspective, trend filtering is the core of most momentum strategies developed in the
asset management industry and the hedge funds community in order to improve
performance and to limit risk of portfolios.
1
For a general review, see Alexandrov et al. (2008).
1
Trading Strategies with L1 Filtering
1.2 Motivations
In economics, the trend-cycle decomposition plays an important role to describe
a non-stationary time series into permanent and transitory stochastic components.
Generally, the permanent component is assimilated to a trend whereas the transitory
component may be a noise or a stochastic cycle. Moreover, the literature on business
cycle has produced a large number of empirical research on this topic (see for example
Cleveland and Tiao (1976), Beveridge and Nelson (1991), Harvey (1991) or Hodrick
and Prescott (1997)). These last authors have then introduced a new method to
estimate the trend of long-run GDP. The method widely used by economists is based
on L2 filtering. Recently, Kim et al. (2009) have developed a similar filter by
replacing the L2 penalty function by a L1 penalty function.
yt = xt + εt
Let us first remind the well-known L2 filter (so-called Hodrick-Prescott filter). This
scheme consists to determine the trend xt by minimizing the following objective
function:
n n−1
1X X
(yt − xt )2 + λ (xt−1 − 2xt + xt+1 )2
2
t=1 t=2
with λ > 0 the regularization parameter which control the competition between the
smoothness of xt and the residual yt −xt (or the noise εt ). We remark that the second
term is the discrete derivative of the trend xt which characterizes the smoothness of
the curve. Minimizing this objective function gives a solution which is the trade-off
between the data and the smoothness of its curvature. In finance, this scheme does
not give a clear signature of the market tendency. By contrast, if we replace the
L2 norm by the L1 norm in the objective function, we can obtain more interesting
properties. Therefore, Kim et al. (2009) propose to consider the following objective
function:
n n−1
1X 2
X
(yt − xt ) + λ |xt−1 − 2xt + xt+1 |
2
t=1 t=2
This problem is closely related to the Lasso regression of Tibshirani (1996) or the L1
regularized least square problem of Daubechies et al. (2004). Here, the fact of taking
the L1 norm will impose the condition that the second derivation of the filtered signal
2
Trading Strategies with L1 Filtering
must be zero. Hence, the filtered signal is composed by a set of straight trends and
breaks2 . The competition between these two terms in the objective function turns
to the competition between the number of straight trends (or number of breaks)
and the closeness to the raw data. Therefore, the smoothing parameter λ plays an
important role for detecting the number of breaks. In the later, we present briefly
how the L1 filter works for the trend detection and its extension to mean-reverting
processes. The calibration procedure for λ parameter will be also discussed in detail.
1
ky − xk22 + λ kDxk22
2
where y = (y1 , . . . , yn ), x = (x1 , . . . , xn ) ∈ Rn and the D operator is the (n − 2) × n
matrix:
1 −2 1
1 −2 1
. .
(1.1)
D=
.
1 −2 1
1 2 1
The exact solution of this estimation is given by
−1
x? = I + 2λD> D y
The idea of L2 filter can be generalized to a lager class so-called Lp filter by using
Lp penalty condition instead of L2 penalty. This generalization is already discussed
in the work of Daubechies et al. (2004) for the linear inverse problem or in the
Lasso regression problem by Tibshirani et al. (1996). If we consider a L1 filter, the
objective function becomes:
n n−1
1X X
(yt − xt )2 + λ |xt−1 − 2xt + xt+1 |
2
t=1 t=2
2
A break is the position where the trend of signal changes.
3
Trading Strategies with L1 Filtering
The first model consists of data simulated by a set of straight trend lines with a
white noise perturbation:
yt = xt + εt
εt ∼ N 0, σ 2
xt = xt−1 + vt (1.2)
Pr {v
t = v t−1 } = p
Pr vt = b U[0,1] − 12 = 1 − p
4
Trading Strategies with L1 Filtering
100 100
50 50
0 0
−50 −50
50 50
0 0
−50 −50
1500 1500
1000 1000
500 500
0 0
1500 1500
1000 1000
500 500
0 0
5
Trading Strategies with L1 Filtering
condition to the first derivative, we can expect to get the fitted signal with zero slope.
The cost of this penalty will be proportional to the number of jumps. In this case,
we would like to minimize the following objective function:
n n
1X X
(yt − xt )2 + λ |xt − xt−1 |
2
t=1 t=2
or in the vectorial form:
1
ky − xk22 + λ kDxk1
2
Here the D operator is (n − 1) × n matrix which is the discrete version of the first
order derivative:
−1 1 0
0 −1 1 0
D=
. ..
(1.4)
−1 1 0
−1 1
We may apply the same minimization algorithm as previously (see Appendix A.1.1).
To illustrate that, we consider the model with step trend lines perturbed by a white
noise process:
y t = x t + εt
εt ∼ N 0, σ 2
(1.5)
Pr {x
t = xt−1 } = p
Pr xt = b U[0,1] − 12 = 1 − p
We employ this model for testing the L1 − C filtering and HP filtering adapted to
the first derivative6 , which corresponds to the following optimization program:
n n
1X X
min (yt − xt )2 + λ (xt − xt−1 )2
2
t=1 t=2
In Figure 1.3, we have reported the corresponding results7 . For the second test,
we consider a mean-reverting process (Ornstein-Uhlenbeck process) with mean value
following a regime switching process:
t − yt−1 ) + εt
yt = yt−1 + θ(x
εt ∼ N 0, σ 2
(1.6)
Pr {x
t = xt−1 } = p 1
Pr xt = b U[0,1] − 2 = 1 − p
Here, µt is the process which characterizes the mean value and θ is inversely propor-
tional to the return time to the mean value. In Figure 1.4, we show how the L1 − C
filter can capture the original signal in comparison to the HP filter8 .
6
We use the term HP filter in order to keep homogeneous notations. However, we notice that
this filter is indeed the FLS filter proposed by Kalaba and Tesfatsion (1989) when the exogenous
regressors are only a constant.
7
The parameters are p = 0.998, b = 50 and σ = 8.
8
For the simulation of the Ornstein-Uhlenbeck process, we have chosen p = 0.9985, b = 20,
θ = 0.1 and σ = 2
6
Trading Strategies with L1 Filtering
30 30
20 20
10 10
0 0
−10 −10
−20 −20
500 1000 1500 2000 500 1000 1500 2000
t t
L1 -C filter HP filter
40 40
30 30
20 20
10 10
0 0
−10 −10
−20 −20
500 1000 1500 2000 500 1000 1500 2000
t t
7
Trading Strategies with L1 Filtering
In Figures 1.5 and 1.6, we test the efficiency of the mixing scheme on the straight
trend lines model (1.2) and the random walk model (1.3)9 .
50 50
0 0
−50 −50
−100 −100
50 50
0 0
−50 −50
−100 −100
8
Trading Strategies with L1 Filtering
1000 1000
500 500
0 0
−500 −500
1000 1000
500 500
0 0
−500 −500
the data while for small values of λ, we obtain short-term trends of the data. In this
paragraph, we attempt to define a procedure which permits to do the right choice
on the smoothing parameter according to our need of trend extraction.
A preliminary remark
For small value of λ, we recover the original form of the signal. For large value of
λ, we remark that there exists a maximum value λmax above which the trend signal
has the affine form:
xt = α + βt
where α and β are two constants which do not depend on the time t. The value of
λmax is given by:
−1
λmax =
DD >
Dy
∞
We can use this remark to get an idea about the order of magnitude of λ which
should be used to determine the trend over a certain time period T . In order to
show this idea, we take the data over the total period T . If we want to have the
global trend on this period, we fix λ = λmax . This λ will gives the unique trend for
the signal over the whole period. If one need to get more detail on the trend over
shorter periods, we can divide the signal into p time intervals and then estimate λ
9
Trading Strategies with L1 Filtering
7.2
6.8
6.6
10
Trading Strategies with L1 Filtering
main window. This parameter characterizes the time horizon of the investment
strategy. Figure 3.7 shows how the data set is divided into different windows in the
cross validation procedure. In order to get the optimal parameter λ, we compute the
total error after scanning the whole data by the window T1 . The algorithm of this
calibration process is described as following:
11
Trading Strategies with L1 Filtering
Figure 1.10 illustrates the calibration procedure for the S&P 500 index with
T1 = 400 and T2 = 50 for the S&P 500 index (the number of observations is equal
to 1 008 trading days). With m = p = 12 and n = 15, the estimated optimal value
λ? for the L1 − T filter is equal to 7.03.
We have observed that this calibration procedure is more favorable for long-term
time horizon, that is to estimate a global trend. For short-term time horizon, the
prediction of local trends is much more perturbed by the noise. We have computed
the probability of having good prediction on the tendency of the market for long-
term and short-term time horizons. This probability is about 70% for 3 months time
horizon while it is just 50% for one week time horizon. It comes that even if the fit is
good for the past, the noise is however large meaning that the prediction of the future
tendency is just 1/2 for an increasing market and 1/2 for a decreasing market. In order
to obtain better results for smaller time horizons, we improve the last algorithm by
proposing a two-trend model. The first trend is the local one which is determined by
the first algorithm with the parameter T2 corresponding to the local prediction. The
second trend is the global one which gives the tendency of the market over a longer
period T3 . The choice of this global trend parameter is very similar to the choice of
the moving-average parameter. This model can be considered as a simple version of
mean-reverting model for the trend. In Figure 1.11, we describe how the data set is
divided for estimating the local trend and the global trend.
The procedure for estimating the trend of the signal in the two-trend model is
summarized in Algorithm 2. The corrected trend is now determined by studying the
relative position of the historical data to the globaltrend. The reference position is
characterized by the standard deviation σ yt − xG t where xG t is the filtered global
12
Trading Strategies with L1 Filtering
7.5
7
80
6.5
e(λ)
60
40
−2 0 2 4 6 8
ln λ
trend.
13
Trading Strategies with L1 Filtering
with T1 = 400 and T2 = 50. The optimal parameters are λ1 = 2.46 (for the L1 − C
filter) and λ2 = 15.94 (for the L2 − T filter). Results are reported in Figure 1.12.
The trend for the next 50 trading days is estimated to 7.34% for the L1 − T filter
and 7.84% for the HP filter whereas it is null for the L1 − C and L1 − T C filters. By
comparison, the true performance of the S&P 500 index is 1.90% from January 3rd,
2011 to March 15th, 201110 .
10
It corresponds exactly to a period of 50 trading days
14
Trading Strategies with L1 Filtering
Results
In the following simulations, we use the estimators µ̂t and σ̂t in place of µt and σt .
For µ̂t , we consider different models like L1 , HP and moving-average filters11 whereas
we use the following estimator for the volatility:
Z t
2 1 T 2 1 X Si
σ̂t = σt dt = ln2
T 0 T Si−1
i=t−T +1
We consider a long/short strategy, that is (αmin , αmax ) = (−1, 1). In the particular
case of the µ̂L
t estimator, we consider three different models:
1
11
We note them respectively µ̂L
t , µ̂t
1 HP
and µ̂MA
t .
15
Trading Strategies with L1 Filtering
3. the combination of both local and global trends corresponds to the third model.
For all these strategies, the test set of the local trend T2 is equal to 6 months (or
130 trading days) whereas the length of the test set for global trend is four times
the length of the test set – T3 = 4T2 – meaning that T3 is one year (or 520 trading
days). This choice of T3 agrees with the habitual choice of the width of the windows in
moving average estimator. The length of the training set is also four times the length
of the test set T1 . The study period is from January 1998 to December 2010. In the
backtest, the trend estimation is updated every day. In Table 2.3, we summarize the
results obtained with the different models cited above for the backtest. We remark
that the best performances correspond to the case of global trend, HP and two-trend
models. Because HP filter is calibrated to the window of the moving-average filter
which is equal to T3 , it is not surprising that the performances of these three models
are similar. On the considered period of the backtest, the S&P does not have a clear
upward or downward trend. Hence, the local trend estimator does not give a good
prediction and this strategy gives the worst performance. By contrast, the two-trend
model takes into account the trade-off between local trend and global trend and gives
a better result
For the sake of simplicity, we assume that all the signals are rescaled to the same
16
Trading Strategies with L1 Filtering
1 X
2
m
(i)
y − x
+ λ kDxk1
2 2
i=1
1.6 Conclusion
Momentum strategies are efficient ways to use the market tendency for building trad-
ing strategies. Hence, a good estimator of the trend is essential from this perspective.
In this paper, we show that we can use L1 filters to forecast the trend of the mar-
ket in a very simple way. We also propose a cross-validation procedure to calibrate
the optimal regularization parameter λ where the only information to provide is the
investment time horizon. More sophisticated models based on a local and global
trends is also discussed. We remark that these models can reflect the effect of mean-
reverting to the global trend of the market. Finally, we consider several backtests
on the S&P 500 index and obtain competing results with respect to the traditional
moving-average filter.
12
For example, we may center and standardize the time series by subtracting the mean and
dividing by the standard deviation.
17
Bibliography
[1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy
T. (2008), A Review of Some Modern Approaches to the Problem of Trend
Extraction , US Census Bureau, RRS #2008/03.
[2] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decompo-
sition of Economic Time Series into Permanent and Transitory Components
with Particular Attention to Measurement of the Business Cycle, Journal of
Monetary Economics, 7(2), pp. 151-174.
[4] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Se-
ries: A Model for the Census X-11 Program, Journal of the American Statistical
Association, 71(355), pp. 581-587.
[6] Efron B., Tibshirani R. and Friedman R. (2009), The Elements of Statistical
Learning, Second Edition, Springer.
[7] Harvey A. (1991), Forecasting, Structural Time Series Models and the Kalman
Filter, Cambridge University Press.
[8] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An
Empirical Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16.
[10] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009), `1 Trend Filtering,
SIAM Review, 51(2), pp. 339-360.
[11] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Jour-
nal of the Royal Statistical Society B, 58(1), pp. 267-288.
19
Chapter 2
We review in this chapter various techniques for estimating the volatility. We start
by discussing the estimators based on the range of daily monitoring data then we con-
sider the stochastic volatility model in order to determine the instantaneous volatility.
At high trading frequency, the stock prices are fluctuated by an additional noise, so-
called the micro-structure noise. This effect comes from the bid-ask bounce due to
the short time scale. Within a short time interval, the trading price does not con-
verge to the equilibrium price determined by the “supply-demand” equilibrium. In
the second part, we discuss the effect of the micro-structure noise on the volatility es-
timation. It is very important topic concerning an enormous field of “high-frequency”
trading. Examples of backtesting on index and stocks will illustrate the efficiency of
considered techniques.
2.1 Introduction
Measuring the volatility is one of the most important questions in finance. As stated
in its name, volatility is the direct measurement of the risk for a given asset. Under
the hypothesis that the realized return follows a Brownian motion, volatility is usually
estimated by the standard deviation of daily price movement. As this assumption
relates the stock price to the most common object of stochastic calculus, many
mathematical work have been carried out on the volatility estimation. With the
increasing of the trading data, we can explore more and more useful information in
order to improve the precision of the volatility estimator. New class of estimators
which are based on the high and low prices was invented. However, in the real world
the asset price is just not a simple geometric Brownian process, different effects
have been observed including the drift or the opening jump. A general correction
21
Volatility Estimation for Trading Strategies
scheme based on the combination of various estimators have been studied in order
to eliminate these effects.
As far as the trading frequency increases, we expect that the precision of estimator
gets better as well. However, when the trading frequency reaches certain limit1 , new
phenomena due to the nonequlibrum of the market emerge and spoil the precision. It
is called the micro-structure noise which is characterized by the bid-ask bounce or the
transaction effect. Because of this noise, realized variance estimator overestimates
the true volatility of the price process. A suggestion based on the use of two different
time scales can aim to eliminate this effect.
The note is organized as following. In Section II, we review the basic volatility
estimator using the variance of realized return (note from B.Bruder article) then we
introduce all the variation based on the range estimation. In section III, we discuss
how to measure the instantaneous volatility and the effect of the lag by doing the
moving-average. In section IV, we discuss the effect of the microstructure on the
high frequency volatility.
22
Volatility Estimation for Trading Strategies
• Hti = maxt∈[ti ,ti+1 [ St is the highest price on a given period [ti , ti+1 [
• Lti = mint∈[ti ,ti+1 [ St is the lowest price on a given period [ti , ti+1 [
• uti = ln Hti − ln Oti is the highest price movement during the trading open
• dti = ln Lti − ln Oti is the lowest price movement during the trading open
• cti = ln Cti − ln Oti is the daily price movement over the trading open period
In the following, we assume that the couple (µt , σt ) is independent to the Brownian
motion Bt of the asset price evolution.
23
Volatility Estimation for Trading Strategies
This quantity can be measured by using the canonical estimator defined as:
n
1 X 2
σ̂ 2 = Rti
tn − t0
i=1
The variance of this estimator is approximated as var σ̂ 2 ≈ 2σ 4 /n or the standard
√ 2 √
deviation is proportional to 2σ / n. It means that the estimation error is small if
√
n is large enough. Indeed the variance of the average volatility reads var 2
σ̂ ≈
√
σ / (2n) and the standard deviation is approximated to σ/ 2n.
2
A simple example of the general definition is the estimator with annualized return
√
Ri / ti+1 − ti . In this case, our estimator becomes:
n
1 X Rt2i
σ̂ 2 =
n tn − t1
i=1
24
Volatility Estimation for Trading Strategies
We remark that if the time step (time increment) is constant ti − ti−1 = T , then we
obtain the same result as the canonical estimator. However, if the time step ti − ti−1
is not constant, the long-term return is underweighted while the short-term return
is overweighted. We will see in the next discussion on the realized volatility, the way
of choosing the weight distribution can help to improve the quality of the estimator.
For example, we will show that the IGARCH estimation can lead to an exponential
weight distribution which is more appropriate to estimate the realized volatility.
25
Volatility Estimation for Trading Strategies
We have seen that the daily deviation can be used to define the estimator of the
volatility. It comes from the fact that one has assumed that the logarithm of price
follows a Brownian motion. We all know that the standard
√ deviation in the diffusive
process over an interval time ∆t is proportional to σ ∆t , hence using the variance
to estimate the volatility is quite intuitive. Indeed, within a given time interval, if
additional information of the price movement is available such as the highest value or
the lowest value, this range must provide as well a good measure of the volatility. This
idea is first addressed by W. Feller in 1951. Later, Parkinson (1980) has employed
the first result of Feller’s work to provide the first “high-low” estimator (so-called
Parkinson estimator). If one uses close prices to estimate the volatility, one can
eliminate the effect of the drift by subtracting the mean value of daily variation.
By contrast, the use of high and low prices can not eliminate the drift effect in
such a simple way. In addition, the high and low prices can be only observed in the
opening interval, then it can not eliminate the second effect due to the opening jump.
However, as demonstrated in the work of Parkinson (1980), this estimator gives a
better confidence but it obviously underestimate the volatility because of the discrete
observation of the price. The maximum and minimum value over a time interval ∆t
are not the true ones of the Brownian motion. They are underestimated then it
is not surprising that the result will depend strongly on the frequency of the price
quotation. In the high frequency market, the third effect can be negligible however
we will discuss this effect in the later. Because of the limitation of Parkinson’s
estimator, an other estimator which is also based on the work of Feller was proposed
by Kunitomo (1992). In order to eliminate the drift, he construct a Brownian bridge
then the deviation of this motion is again related to the diffusion coefficient. In the
same line of thought, Rogers and Satchell (1991) propose an other use of high and
low prices in order to obtain a drift-independent volatility estimator. In this section,
we review the three techniques which are always constrained by the opening jump.
Let us consider the random variable uti − dti (namely the range of the Brownian
motion over the period [ti , ti+1 [), then the Parkinson estimator is defined by using
the following result (Feller 1951):
h i
E (u − d)2 = (4 ln 2) σ 2 T
26
Volatility Estimation for Trading Strategies
In order to estimate the error of the estimator, we compute the variance of σ̂P2 which
is given by the following expression:
4
2
9ζ (3) σ
var σ̂P = 2 −1
16 (ln 2) n
32 (ln 2)2
eff σ̂P2 = = 4.91
9ζ (3) − 16 (ln 2)2
1 Xh i
n
2
σ̂GK = 0.511 (uti − dti )2 − 0.019 (cti (uti + dti ) − 2uti dti ) − 0.383c2ti
nT
i=1
The minimal value of the variance corresponding
to the quadratic estimator is var σ 2
GK =
0.27σ 4 /n and its efficiency is now eff σGK
2 = 7.4.
27
Volatility Estimation for Trading Strategies
Dti = Mti − mti where Mti = maxt∈[ti ,ti+1 [ Wt and mti = mint∈[ti ,ti+1 [ Wt . It has
been demonstrated that the variance of the range of Brownian bridge is directly
proportional to the volatility (Feller 1951):
E D2 = T π 2 σ 2 /6 (2.4)
Higher moment of the Brownian bridge can be also calculated analytically and is
given by the formula 2.10 in Kunitomo (1992).
In particular, the variance of the
Kunitomo’s estimator is equal to var σK
2 = σ 4 /5n which implies the efficiency of
this estimator eff σK
2 = 10.
E [u (u − c) + d (d − c)] = σ 2 T
This expectation value does not depend on the drift of the Brownian motion, hence
it does provide a drift-independent estimator which can be defined as:
n
2 1 X
σ̂RS = [uti (uti − cti ) + dti (dti − cti )]
nT
i=1
The variance of this estimator is given by var σ̂RS
2 = 0.331σ 4 /n which gives an
efficiency eff σ̂RS
2 = 6.
Like the other techniques based on the range ”high-low”, this estimator underes-
timates the volatility due to the fact that the maximum of a discretized Brownian
motion is smaller than the true value. Rogers and Satchell have also proposed a cor-
rection scheme which can be generalized for other technique. Let M be the number
of quoted price, then h = T /M is the step of the discretization, then the corrected
estimator taking account of the finite step error is give by the root of the following
equation: √
σ̂h2 = 2bhσ̂h2 + 2 (u − d) a hσ̂h + σ̂RS
2
√ √
where a = 2π 1/4 − 2 − 1 /6 and b = (1 + 3π/4) /12.
28
Volatility Estimation for Trading Strategies
Applying the same idea, Yang and Zhang (2000) have proposed another combi-
nation which can also eliminate both effect as Kunimoto estimator. They choose the
following combination:
2
σ̂O 1−α
σ̂Y2 Z = α + 2
κσ̂C 2
+ (1 − κ) σ̂HL
f 1−f
In the work of Yang ans Zhang, they have used σ̂RS 2 as high-low estimator because
29
Volatility Estimation for Trading Strategies
Figure 2.2: Volatility estimators without drift and opening effects (M = 50)
20
Simulated σ, CC, OC, P, K, GK, RS, RS−h, YZ
19
18
17
16
σ (%)
15
14
13
12
11
10
0 100 200 300 400 500 600 700 800 900 1000
30
Volatility Estimation for Trading Strategies
Figure 2.3: Volatility estimators without drift and opening effect (M = 500)
20
Simulated σ, CC, OC, P, K, GK, RS, RS−h, YZ
19
18
17
16
σ (%)
15
14
13
12
11
10
0 100 200 300 400 500 600 700 800 900 1000
Figure 2.4: Volatility estimators with µ = 30% and without opening effect (M = 500)
24
22
σ (%)
20
18
16
14
12
0 100 200 300 400 500 600 700 800 900 1000
31
Volatility Estimation for Trading Strategies
Figure 2.5: Volatility estimators with opening effect f = 0.3 and without drift
(M = 500)
20
Simulated σ, CC, OC, P, K, GK, RS, RS−h, YZ
19
18
17
16
15
σ (%)
14
13
12
11
10
9
0 100 200 300 400 500 600 700 800 900 1000
Figure 2.6: Volatility estimators with correction of the opening jump (f = 0.3)
32
Volatility Estimation for Trading Strategies
We will first estimate the volatility with all the proposed estimators then verify
the quality of these estimators via a backtest using the voltarget strategy2 . For
the simulation of the volatility, we take the same parameters as above with f = 0,
µ = 0, N = 5000, M = 500, ξ = 0.01 and σ0 = 0.4. In Figure 2.7, we present the
result corresponding to different estimators. We remark that the group of high-low
estimators gives a better result for volatility estimation. We can estimate the error
55
Simulated σ, CC, OC, P, K, GK, RS, RS−h, YZ
50
45
40
35
σ (%)
30
25
20
15
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
The errors obtained for various estimators are summarized in the below Table 2.1.
We now apply the estimation of the volatility to perform the voltarget strate-
gies. The result of the this test is presented in Figure 2.8. In order to control the
2
The detail description of voltarget strategy is presented in Section Backtest
33
Volatility Estimation for Trading Strategies
quality of the voltarget strategy, we compute the volatility of the voltarget strategy
obtained by each estimator. We remark that the calculation of the volatility on
the voltarget strategies is effectuated by the close-to-close estimator with the same
averaging window of 3 months (or 65 trading days). The result is reported in Fig-
ure 2.9. As shown in the figure, all estimators give more and less the same results.
If we compute the error committed by these estimators, we obtain CC = 0.9491,
P = 1.0331, K = 0.9491, GK = 1.2344, RS = 1.2703, Y Z = 1.1383. This result
may comes form the fact that we have used the close-to-close estimator to calculate
the volatility of all voltarget strategies. Hence, we consider another check of the
2.6
Benchmark, CC, OC, P, GK, RS, YZ
2.4
2.2
1.8
1.6
1.4
1.2
0.8
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
t
where Vti is the wealth of the voltarget portfolio. We expect that this quantity
follows a Gaussian probability distribution with volatility σ ? = 15%. Figure 2.10
shows the probability density function (Pdf) of the realized returns corresponding
to all considered estimators. In order to have a more visible result, we compute the
different between the cumulative distribution function (Cdf) of each estimator and
34
Volatility Estimation for Trading Strategies
25
CC, OC, P, K, GK, RS, YZ
20
σ (%)
15
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
the expected Cdf (see Figure 2.11). Both results confirm that the Parkinson and the
Kunitomo estimators improve the quality of the volatility estimation.
2.2.6 Backtest
Volatility estimations of S&P 500 index
We now employ the estimators discussed above for the S&P 500 index. Here, we
do not have all tick-by-tick intraday data, hence the Kunimoto’s estimator and the
Rogers-Satchell correction can not be applied.
We remark that the effect of the drift is almost negligible which is confirmed
by Parkinson and Garman-Klass estimators. The spontaneous opening jump is esti-
mated simply by:
2 !
σ̂C
ft = 1 +
σ̂O
35
Volatility Estimation for Trading Strategies
45
Expected Pdf, CC, OC, P, K, GK, RS, YZ
40
35
30
25
Pdf
20
15
10
0
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05
RV
0.06
0.05
0.04
0.03
∆Cdf
0.02
0.01
−0.01
−0.02
36
Volatility Estimation for Trading Strategies
90
80
70
60
σ (%)
50
40
30
20
10
70
60
50
σ (%)
40
30
20
10
37
Volatility Estimation for Trading Strategies
Figure 2.14: Estimation of the closing interval for S&P 500 index
0.15
Realized closing ratio
Moving average
Exponential average
Cummulated average
Average
0.1
f
0.05
0
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
0.6
f
0.4
0.2
38
Volatility Estimation for Trading Strategies
BHI UN Equity data. Figure 2.13 shows that the family of high-low estimator give
a better result than the calissical close-to-close estimator. In order to check the qual-
ity of these estimators on the prediction of the volatility, we checke the value of the
“Likehood” function corresponding to each estimator. Assuming that the observable
signal follows the Gaussian distribution, the likehood function is defined as:
n n
n 1X 2 1 X Ri+1 2
l(σ) = − ln 2π − ln σi −
2 2 2 σi
i=1 i=1
where R is the future realized return. In Figure 2.17, we present the result of the
likehood function for different estimators. This function reaches its maximal value
for the ‘Roger-Satchell’ estimator.
4
x 10
1.98
1.97
1.96
1.95
1.94
CC OC P GK RS YZ
39
Volatility Estimation for Trading Strategies
4
x 10
1.806
1.804
1.802
1.8
1.798
1.796
1.794
CC OC P GK RS YZ
and compute the number of times where the high-low estimator gives better perfor-
mance over the ensemble of stocks. The result over S&P 500 index and its first 100
compositions is summarized in Table 2.3.
40
Volatility Estimation for Trading Strategies
1.3
S&P 500, CC, OC, P, GK, RS, YZ
1.2
1.1
0.9
0.8
0.7
0.6
0.5
3
Benchmark, CC, OC, P, GK, RS, YZ
2.5
1.5
41
Volatility Estimation for Trading Strategies
Estimator σ̂P2 2
σ̂GK 2
σ̂RS σ̂Y2 Z
6 month 56.2% 52.8% 52.8% 57.3%
3 month 52.8% 49.4% 51.7% 53.9%
2 month 60.7% 60.7% 60.7% 56.2%
1 month 65.2% 64.0% 64.0% 64.0%
Estimator σ̂P2 2
σ̂GK 2
σ̂RS σ̂Y2 Z
fˆc 65.2% 64.0% 64.0% 64.0%
fˆma 64.0% 61.8% 61.8% 64.0%
fˆexp 64.0% 61.8% 60.7% 64.0%
fˆt 64.0% 61.8% 60.7% 64.0%
In order to overcome this dilemma, we need to have an idea about the dynamics
of the variance σt2 that we would like to measure. Combining this knowledge on the
dynamics of σt2 with the committed error on the long historical window, we can find
out an optimal windows for the volatility estimator. We assume that the variance
follows a simplified dynamics which has been used in the last numerical simulation:
dSt = µt St dt + σt St dBt
dσt2 = ξσt2 dBtσ
42
Volatility Estimation for Trading Strategies
Here, the time increment is chosen to be constant ti − ti−1 = T , then the variance
of this estimator at instant tn is:
2
2σt4n T 2σt4n
var σ̂ ≈ =
tn − t0 n
On another hand, σt2 is now itself a stochastic process, hence its conditional variance
to σt2n gives us the error due to the use of historical observations. We rewrite:
Z tn Z tn
1 2 2 1
σ dt = σtn − (t − t0 ) σt2 ξ dBtσ
tn − t0 t0 t tn − t0 t0
then the error due to the stochastic volatility is given by:
Z tn
1 2 tn − t0 4 2 nT σt4n ξ 2
var σt dt σtn
2
≈ σtn ξ =
tn − t0 t0 3 3
The total error of the canonical estimator is simply the sum of these errors due to
the fact that the two considered Brownian motions are supposed to be independent.
We define the function of total estimation errors as following:
2σ 4 nT σt4n ξ 2
e σ̂ 2 = tn +
n 3
In order to obtain the optimal window for volatility estimation, we minimize the
error function e σ̂ 2 with respect to nT which leads to the following equation:
σt4n ξ 2 2σt4n
− 2 = 0
3 n T
√
This equation provides
p a very simple solution nT = 6T /ξ with the optimal error
is now e σ̂opt
2 ≈ 2 2T /3 σt4n ξ. The major difficulty of this estimator is to calibrate
the parameter ξ which is not trivial because ξt2 is an unobservable process. Different
techniques can be considered such as the maximum likehood which will be discussed
later.
43
Volatility Estimation for Trading Strategies
We remark that the contribution of the last term tends to 0 when n tends to infinity.
This estimator again has the form of a weighted average then similar approach as
in the canonical estimator is applicable. Assuming that the volatility follows the
lognormal dynamics described by Equation 2.3, therefore the optimal value of β is
given by:
p
ξ 8T − ξ 2 T 2 − 4
β? = (2.7)
ξ2T − 4
We encounter here again the same question as the canonical case that is how to
calibrate the parameter ξ of the lognormal dynamics. In practice, we proceed in the
inverse way. We seek first the optimal value β ? of the IGARCH estimator then use
the inverse relation of equation 2.7 to determine the value of ξ:
s
4 (1 − β ? )2
ξ=
T 1 + β ?2
with λ = − ln β/T . In this present form, we conclude that the IGARCH estimator is
a weighted-average of the variance σt2 with an exponential weight distribution. The
annualized estimator of the volatility can be written as:
P+∞ R t−iT +T 2
i=1 e−iT λ t−iT σu du
E σ̂t2 σ = P+∞ −iT λ
i=1 T e
44
Volatility Estimation for Trading Strategies
1 2σt4n nT 4 2
e(σ̂ 2 ) = + σ ξ
eff (σ̂ 2 ) n 3 tn
The minimization procedure of the total error is exactly the same as the last exam-
ple on the canonical estimator, then we obtain the following result of the optimal
averaging window: s
6T
nT = (2.8)
eff (σ̂ 2 ) ξ 2
The IGARCH estimator can also be applied for various type of high-low esti-
mator, the extension consists of performing an exponential moving average in stead
of the simple average. The parameter of the exponential moving average β will
be determined again by the maximum likehood method as shown in the discussion
below.
• or using the maximum likehood problem which consists to maximize the log-
likehood objective function:
X1 n n
n X Rt2i +T
− ln 2π − ln T σ̂t2 −
2 2
i=0
2T σ̂t2i
i=0
45
Volatility Estimation for Trading Strategies
We remark here that the moving-average estimator depends only on the averaging
window whereas the IGARCH estimator depends only on the parameter β. In gen-
eral, there is no way to compare these two estimators if we do not use a specific
dynamics. By this way, the optimal values of both parameters are obtained by the
optimal value of ξ and that offers a direct comparison between the quality of these
two estimators.
1720
CC optimal
IGARCH
1715
1710
1705
1700
l(ξ)
1695
1690
1685
1680
1675
1670
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
ξ
46
Volatility Estimation for Trading Strategies
1485
1480
1475
1470
l(β)
1465
1460
1455
1450
1445
1440
0.7 0.75 0.8 0.85 0.9 0.95 1
β
47
Volatility Estimation for Trading Strategies
Figure 2.22: Likehood function of high-low estimators versus effective moving window
1485
CC, OC, P, GK, RS, YZ
1480
1475
1470
1465
l(n)
1460
1455
1450
1445
1440
0 10 20 30 40 50 60 70 80 90
n
100
CC
CC optimal
IGARCH
80
60
σ (%)
40
20
48
Volatility Estimation for Trading Strategies
Figure 2.24: Comparison between different IGARCH estimators for high-low prices
80
70
60
σ (%)
50
40
30
20
10
Figure 2.25: Daily estimation of the likehood function for various close-to-close esti-
mators
1900
CC
CC optimal
CC IGARCH
1800
1700
1600
l(σ̂)
1500
1400
1300
49
Volatility Estimation for Trading Strategies
Figure 2.26: Daily estimation of the likehood function for various high-low estimators
1900
CC, OC, P, GK, RS, YZ
1800
1700
1600
l(σ̂)
1500
1400
1300
1200
In Figure 2.27, the result of the backtest on voltarget strategy is performed for
the three considered estimators. The estimators which dynamical choice of averaging
parameters always give better result than a simple close-to-close estimator with fixed
averaging window n = 25. We next backtest on the IGARCH estimator applied on
the high-low price data, the comparison with IGARCH applied on close-to-close data
is shown in Figure 2.28. We observe that the IGARCH estimator for close-to-close
price is one of the estimators which produce the best backtest.
50
Volatility Estimation for Trading Strategies
S&P 500
1.4 CC
CC optimal
CC IGARCH
1.2
0.8
0.6
Figure 2.28: Backtest for IGARCH high-low estimators comparing to IGARCH close-
to-close estimator
1.2
0.8
0.6
51
Volatility Estimation for Trading Strategies
The observed signal Yt is the cumulated return which is perturbed by the microstruc-
ture noise t :
Yt = Xt + t
(ii) t ⊥
⊥ Bt
From these assumptions, we see immediately that the volatility estimator based on
historical data Yti is biased:
var(Y ) = var(X) + E 2
The first term var(X) is scaled as t (estimation horizon) and E 2 is constant,
this estimator
can be considered as unbiased if the time horizon is large enough
(t > E 2 /σ 2 ). At high frequency, the second term is not negligible and better
estimator must be able to eliminate this term.
52
Volatility Estimation for Trading Strategies
For the discretized version of the quadratic variation, we employ the [., .] notation:
X 2
[X, X]T = Xti+1 − Xti
ti ,ti+1 ∈[0,T ]
Then the habitual estimator of realized return over the interval [0, T ] is given by:
X 2
[Y, Y ]T = Yti+1 − Yti
ti ,ti+1 ∈[0,T ]
We remark that the number of points in the interval [0, T ] can be changed. In fact,
the expectation value of the quadratic variation should not depend on the distribution
of points in this interval. Let us define the ensemble of points in one period as a grid
G:
G = {t0 , . . . , tM }
Then a subgrid H is defined as:
H = {tk1 , . . . , tkm }
Under the hypothesis of the microstructure noise, the conditional expectation value
of this estimator is equal to:
h i
E [Y, Y ]GT X = [X, X]GT + 2M E 2
53
Volatility Estimation for Trading Strategies
In these two expressions above, the sums are arranged order by order. In the limit
M → ∞, we obtain the habitual result of central limit theorem:
L 1/2
M −1/2 [Y, Y ]GT − 2M E 2 −→ 2 E 4 N (0, 1)
Hence, as M increases, [Y, Y ]GT becomes a good estimator of the microstructure noise
and we denote:
1
E[[2 ] = [Y, Y ]GT
2M
The central limit theorem for this estimator states:
L 1/2
M 1/2 E[ [2 ] − E 2 − → E 4 N (0, 1) as M → ∞
As we mentioned in the last discussion, increasing the frequency will spoil the esti-
mation of the volatility due to the presence of the microstructure noise. The naive
solution is to reduce the number of point in the grid or to consider only a subgrid,
then one can take the average over a number choice of subgrids. Let us consider a
subgrid H with |H| = m − 1, then the same result as for the full grid can be obtained
in replacing M by m:
h i 2
H
E [Y, Y ]T X = [X, X]H T + 2mE
Let us
SKnow (k)
consider a sequence of subgrids H(k) with k = 1 . . . K which satisfies
G = k=1 H and H(k) ∩ H(l) = ∅ with k 6= l. By averaging over these K subgrid,
we obtain the result:
h i K
1 X
avg (k)
E [Y, Y ]T X = [Y, Y ]H
T
K
k=1
P
We define the average length of the subgrid m = (1/K) Kk=1 mk , then the final
expression is: i
h 2
E [Y, Y ]avg avg
T X = [X, X]T + 2mE
This estimator of volatility is still biased and the precision depends strongly on the
choice of the length of subgrid and the number of subgrids. In the paper of Zhang et
al., the authors have demonstrated that there exists an optimal value K ? for which
we can reach the best performance of estimator.
54
Volatility Estimation for Trading Strategies
where the indice k = 1, . . . , K and nk is the integer making tk−1+nk K the last element
in H(k) . As we can not compute exactly the value of the optimal value K ? for each
trading period, we employ an iterative scheme which tends to converge to the optimal
value. Analytical expression of K ? is given by Zhang et al.:
2 !1/3
12 E 2
K? = M 2/3
T Eη 2
In the first approximation, we consider the case where the intraday volatility is
constant then the expression of η cans be simplified to η 2 = T σ 4 . In Figure 2.29, we
present the result of the intraday volatility which takes into account only the trading
day for the S&P 500 index under the assumption of constant volatility. The two-
time scale estimator reduces the effect of microstructure noise effect on the realized
volatility computed over the full grid.
55
Volatility Estimation for Trading Strategies
35
Volatility with full grid
Volatility with subgrid
30 Volatility with two scales
25
20
σ (%)
15
10
0
02/11 03/11 04/11 05/11 06/11
2.5 Conclusion
Voltarget strategies are efficient ways to control the risk for building trading strate-
gies. Hence, a good estimator of the volatility is essential from this perspective. In
this paper, we show that we can use the data rang to improve the forecasting of the
volatility of the market. The use of high and low prices is less important for the index
as it gives more and less the same result with traditional close-to-close estimator.
However, for independent stock with higher volatility level, the high-low estimators
improves the prediction of volatility. We consider several backtests on the S&P 500
index and obtain competing results with respect to the traditional moving-average
estimator of volatility.
Indeed, we consider a simple stochastic volatility model which permit to integrate
the dynamics of the volatility in the estimator. An optimization scheme via the
maximum likehood algorithm allows us to obtain dynamically the optimal averaging
window. We also compare these results for rang-based estimator with the well-
known IGARCH model. The comparison between the optimal value of the likehood
functions for various estimators gives us also a ranking of estimation error.
Finally, we studied the high frequency volatility estimator which is a very active
topic of financial mathematics. Using simple model proposed by Zhang et al, (2005),
we show that the microstructure noise can be eliminated by the two time scale
estimator.
56
Bibliography
[3] Drost F. C. and Werker J. M. (1999), Closing the GARCH gap: Continuous
time GARCH modeling Journal of Econometrics, 74, pp. 31-57 .
[4] Feller W. (1951), The Asymptotic Distribution of the Range of Sums of Inde-
pendent Random Variables, Annals of Mathematical Statistics, 22, pp. 427-432.
[7] Parkinson M. (1980), The extreme value method for estimating the variance
of the rate of return, Journal of Business, 53, pp. 61-65.
[10] Zhang L., Mykland P. A. and Ait-Sahalia Y. (2005), A Tale of Two Time
Scales: Determining Integrated Volatility With Noisy High-Frequency Data
Journal of the American Statistical Association, 100(472), pp. 1394-1411.
57
Chapter 3
3.1 Introduction
Support vector machine is an important part of the Statistical Learning Theory. It
was first introduced in the mid-90 by Boser et al., (1992) and contributes important
applications for various domains such as pattern recognition (for example: handwrit-
ten, digit, image), bioinformatic e.t.c. This technique can be employed in different
contexts such as classification, regression or density estimation according to Vapnik
[1998]. Recently, different applications in the financial field have been developed via
two main directions. The first one employs SVM as non-linear estimator in order
to forecast the market tendency or volatility. In this context, SVM is used as a re-
gression technique with feasible possibility for extension to non-linear case thank to
the kernel approach. The second direction consists of using SVM as a classification
technique which aims to elaborate the stock selection in the trading strategy (for ex-
ample long/short strategy). In this paper, we review the support vector machine and
its application in finance in both points of view. The literature of this recent field
is quite diversified and divergent with many approaches and different techniques.
We would like first to give an overview on the SVM from its basic construction to
all extensions including the multi classification problem. We next present different
numerical implementations, then bridge them to financial applications.
59
Support Vector Machine in Finance
H = {x : h(x) = wT x + b = 0}
This hyperplane divides the space X into two regions: the region where the discrimi-
nant function has positive value and the region with negative value. The hyperplane
is the also called the decision boundary. The linear classification comes from the fact
that this boundary depends on the data in the linear way.
60
Support Vector Machine in Finance
We now define the notion of a margin. In Figure 3.1 (reprinted from Ben-Hur
A. et al., 2010), we give a geometric interpretation of the margin in a linear SVM.
Let x+ and x− be the closest points to the hyperplane from the positive side and
negative side. The cycle data points are defined as the support vectors which are
the closest points to the decision boundary (see Figure 3.1). The vector √ w is the
normal vector to the hyperplane and we denote its norm kwk = wT w and its
direction ŵ = w/kwk. We assume that x+ and x− are equidistant from the decision
boundary. They determine the margin from which the two classes of points of data
set D are separated:
1
mD (h) = ŵT (x+ − x− )
2
In the geometric consideration, this margin is just half of the distant between two
closest points from both sides of the hyperplane H projected in the direction ŵ.
Using the equations that define the relative positions of these points to the hyperplane
H:
h(x+ ) = wT x+ + b = a
h(x− ) = wT x− + b = −a
where a > 0 is some constant. As the normal vector w and the bias b are undeter-
mined quantity, we can simply divide them by a and renormalized all these equations.
This is equivalent to set a = 1 in the above expression and we finally get
1 1
mD (h) = ŵT (x+ − x− ) =
2 kwk
61
Support Vector Machine in Finance
The basic idea of maximum margin classifier is to determine the hyperplane which
maximizes the margin. For a separable dataset, we can define the hard margin SVM
as the following optimization problem:
1
min kwk2 (3.1)
w,b 2
u.c. yi wT xi + b > 1 i = 1...n
Here, yi wT xi + b > 1 is just a compact way to express the relative position of two
classes of data points to the hyperplane H. In fact, we have wT xi + b > 1 for the
class yi = 1 and wT xi + b < −1 for the class yi = −1.
The historical approach to solve this quadratic program is to map the primal
problem to dual problem. We give here the main result while the detailed derivation
can be found in the Appendix C.1. Via KKT theorem, this approach gives us the
following optimized solution (w? , b? ):
n
X
w? = αi? yi xi
i=1
where α? = (α1? , . . . , αn? ) is the solution of the dual optimization problem with dual
variable α = (α1 , . . . , αn ) of dimension n:
n
X n
1 X
max αi − αi αj yi yj xTi xj
α 2
i=1 i,j=1
u.c. αi ≥ 0 i = 1...n
62
Support Vector Machine in Finance
Cortes C. and Vapnik V. (1995) under the name “Soft margin SVM”. The second one
consists of using the non-linear classifier which directly extend the function space to
higher dimension. The use of non-linear classifier can increase rapidly the dimension
of the optimization problem which invokes a computation problem. An intelligent
way to get over is employing the notion of kernel. In the next discussions, we will try
to clarify these two approaches then finish this section by introducing two general
frameworks of this learning theory.
X n
1
min kwk2 + C ξi (3.3)
w,b,ξ 2
i=1
u.c. yi wT xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n
Here, C is the parameter used to fix our desired level of error and p ≥ 1 is an usual
way to fix the convexity on the additional term 1 . The soft-margin solution for the
SVM problem can be interpreted as a regularization technique that one can find
different optimization problem such as regression, filtering or matrix inversion. The
same result can be found with regularization technique later when we discuss the
possible use of kernel.
63
Support Vector Machine in Finance
h (x) = wT φ (x) + b
B = {x : wT φ (x) + b = 0}
At this stage, the generalization to non-linear case helps us to avoid the problem
of overfitting or underfitting. However, a computation problem emerges due to the
high dimension of the feature space. For example, if we consider an quadratic trans-
formation, it can lead to a feature space of dimension N = d(d + 3)/2. The main
question is how to construct the separating hyperplane in the feature space? The
answer to this question is to employ the mapping to the dual problem. By this
way, our N -dimension problem turn again to the following n-dimension optimization
problem with dual variable α:
n
X n
1 X
max αi − αi αj yi yj φ (xi )T φ (xj )
α 2
i=1 i,j=1
u.c. αi ≥ 0 i = 1...n
Indeed, the expansion of the optimal solution w? has the following form:
n
X
?
w = αi? yi φ (xi )
i=1
In order to solve the quadratic program, we do not need the explicit form of the
non-linear application but only the kernel of the form K (xi , xj ) = φ (xi )T φ (xj )
which is usually supposed to be symmetric. If we provide only the kernel K (xi , xj )
for the optimization problem, it is enough to construct later the hyperplane H in
the feature space F or the boundary decision in the data space X . The discriminant
function can be computed as following thank to the expansion of the optimal w? on
the initial data xi i = 1, . . . , n:
n
X
h (x) = αi yi K (x, xi ) + b
i=1
From this expression, we can construct the decision function which can be used to
classified a given input x as f (x) = sign (h (x)).
64
Support Vector Machine in Finance
For a given non-linear function φ (x), we can compute the kernel K (xi , xj ) via
the scalar product of two vector in F space. However, the reciprocal result does not
stay unless the kernel satisfies the condition of the Mercer’s theorem (1909). Here, we
study some standard kernel which are already widely used in the pattern recognition
domain:
p
i. Polynomial kernel: K (x, y) = xT y + 1
ii. Radial Basis kernel: K (x, y) = exp −kx − yk2 /2σ 2
iii. Neural Network kernel: K (x, y) = tanh axT y − b
65
Support Vector Machine in Finance
As the distribution P (x, y) is unknown then the expected loss can not be evaluated.
However, with available training dataset {xi , yi }, one could compute the empirical
risk as following:
n
1X
Remp = l (f (xi ) , y)
n
i=1
In the limit of large dataset n → ∞, we expect the convergence: Remp (f ) → R (f )
for all tested function f thank to the law of large number. However, does the learning
function f which minimizes Remp (f ) is the one minimizing the true risk R (f )? The
answer to this question is NO. In general, there is infinite number of function f which
can learn perfectly the training dataset f (x) = yi ∀i. In fact, we have to restraint
the function space F in order to ensure the uniform convergence of the empirical
risk to the true risk. The characterization of the complexity of a space of function F
was first studied in the VC theory via the concept of VC dimension (1971) and the
important VC theorem which gives an upper bound of the convergence probability
P {sup f ∈ F |R (f ) − Remp (f )| > ε} → 0.
A common way to restrict the function space is to impose a regularization condi-
tion. We denote Ω (f ) as a measurement of regularity, then the regularized problem
consists of minimizing the regularized risk:
Rreg (f ) = Remp (f ) + λΩ (f )
Here λ is the regularization parameter and Ω (f ) can be for example the Lp norm on
some deviation of f .
66
Support Vector Machine in Finance
then VC dimension is d + 1
This theorem gives the explicit relation between the VC dimension and the number
of factors or the number of coordinates in the input vector of the training set. It
can be used in the next theorem in order to evaluate the necessary information for
having a good classification or regression.
An important corollary of the VC theorem is the upper bound for the convergence
of the empirical risk function to the risk function:
Corollary 3.2.4 Under the same hypothesis of the VC theorem, the following in-
equality is hold with the probability 1 − η:
s η
h ln 2n
h + 1 − ln 4 1
∀f ∈ F, R (f ) − Remp (f ) ≤ +
n n
We will skip all the proofs of these theorems and postpone the discussion on the
importance of VC theorems important for practical use later in Section 6 as the
overfit and underfit problems are very present in any financial applications.
where δxi (x), δyi (y) are Dirac distributions located at xi and yi respectively. In the
VRM framework, instead of dPemp , the Dirac distribution is replaced by an estimate
density in the vicinity of xi :
n
1X
dPvic (x, y) = dPxi (x)δyi (y)
n
i=1
67
Support Vector Machine in Finance
In order to illustrate the different between the ERM framework and VRM framework,
let us consider the following example of the linear regression. In this case, our loss
function l (f (x) , y) = (f (x) − y)2 where the learning function is of the form f (x) =
wT x + b. Assuming that the vicinal density probability dPxi (x) is approximated by
a white noise of variance σ 2 . The vicinal risk is calculated as following:
n Z
1X
Rvic (f ) = (f (x) − yi )2 dPxi (x)
n
i=1
n Z
1X
= (f (xi + ε) − yi )2 dN 0, σ 2
n
i=1
Xn
1
= (f (xi ) − yi )2 + σ 2 kwk2
n
i=1
68
Support Vector Machine in Finance
Classification problem
As introduced in the last section, the classification encounters two main problems:
the overfitted problem and the underfitted problem. If the dimension of the function
space is two large, the result will be very sensible to the input then a small change in
the data can cause an instability in the final result. The second one consists of non-
separable data in the sense that the function space is too small then we can not obtain
a solution which minimizes the risk function. In both case, regularization scheme is
necessary to make the problem well-posed. In the first case, on should restrict the
function space by imposing some condition and working with some specific function
class (linear case for example). In the later case, on needs to extend the function
space by introducing some tolerable error (soft-margin approach) or working with
non-linear transformation.
a) Linear SVM with soft-margin approach
In the work of Cortes C. and Vapnik V. (1995), they have first introduced
the notion of soft-margin by accepting that there will be some error in the
classification. They characterize this error by additional variables ξi associated
to each data points xi . These parameters intervene in the classification
via the
constraints. For a given hyperplane, the constrain yi wT xi + b ≥ 1 which
means that the point xi is well-classified
and is out of the margin. When we
change this condition to yi w xi + b ≥ 1 − ξi with ξi ≥ 0 i = 1...n, it allow
T
first to point xi to be well-classified but in the margin for 0 ≤ ξi < 1. For the
value ξi > 1, there is a possibility that the input xi is misclassified. As written
above, the primal problem becomes an optimization with respect to the margin
and and the total committed error.
n
!
1 2
X p
min kwk + C.F ξi
w,b,ξ 2 i=1
T
u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n
Here, p is the degree of regularization. We remark that only for the choice of
p ≥ 1 the a soft-margin can have an unique solution. The function F (u) is
usually chosen as a convex function with F (0) = 0, for example F (u) = uk .
In the following we consider two specific cases: (i) Hard-margin limit with
C = 0; (ii) L1 penalty with F (u) = u, p = 1. We define the dual vector
Λ = (α1 , . . . , αn ) and the output vector y = (y1 , . . . , yn ). In order to write
the optimization problem in vectorial form, we define as well the operator
D = (Dij )n×n with Dij = yi yj xTi xj .
i. Hard-margin limit with C = 0. As shown in Appendix C.1.1, this problem
can be mapped to the following dual problem:
1
max ΛT 1 − ΛT DΛ (3.5)
Λ 2
T
u.c. Λ y = 0, Λ ≥ 0
69
Support Vector Machine in Finance
ii. L1 penalty with F (u) = u, p = 1. In this case the associated dual problem
is given by:
1
max ΛT 1 − ΛT DΛ (3.6)
Λ 2
T
u.c. Λ y = 0, 0 ≤ Λ ≤ C1
Remark 2 For the case with L2 penalty (F (u) = u, p = 2), we will demon-
strate in the next discussion that it is a special case of kernel approach for
hard-margin case. Hence, the dual problem is written exactly the same as hard-
margin case with an additional regularization term 1/2C added to the matrix
D:
T 1 T 1
max Λ 1 − Λ D+ I Λ (3.7)
Λ 2 2C
u.c. ΛT y = 0, Λ ≥ 0
70
Support Vector Machine in Finance
In summary, the linear SVM is nothing else a special case of the non-linear SVM
within kernel approach. In the later, we study the SVM problem only for the two
case with hard and soft margin within the kernel approach. After obtaining the
optimal vector Λ? by solving the associated QP program described above, we can
compute b by
Pnthe KKT condition then derive the decision function f (x). We remind
that w = i=1 αi yi φ (x).
? ?
We notice that for the value αi > 0, the inequality constraint becomes equal-
ity. As the inequality constraint becomes equality constrain, these points are
the closest points to the optimal frontier and they are called support-vectors.
Hence, b can be computed easily for a given support vector (xi , yi ) as following:
b? = yi − w?T φ (xi )
ii. For the soft-margin case, KKT condition given in Appendix C.1.2 is slightly
different:
αi? yi w?T φ (xi ) + b? − 1 + ξi = 0
However, if αi satisfies the condition 0 ≤ αi ≤ C then we can show that
ξi = 0. The condition 0 ≤ αi ≤ C defines the subset of training points
(support vectors) which are closest to the frontier of separation. Hence, b can
be computed by exactly the same expression as the hard-margin case.
From the optimal value of the triple (Λ? , w? , b? ), we can construct the decision
function which can be used to classified a given input x as following:
n
!
X
?
f (x) = sign αi yi K (x, xi ) + b (3.8)
i=1
Regression problem
In the last sections, we have discussed the SVM problem only in the classification
context. In this section, we show how the regression problem can be interpreted as a
SVM problem. As discussed in the general frameworks of statistical learning (ERM
71
Support Vector Machine in Finance
or VRM), the SVM problem consists of minimizing the risk function Remp or Rvic .
The risk function can be computed via the loss function l (f (x) , y) which defines
our objective (classification or regression). Explicitly, the risk function is calculated
as: Z
R (f ) = l (f (x) , y) dP (x, y)
where the distribution dP (x, y) can be computed in the ERM framework or in the
VRM framework. For the classification problem, the loss function is defined as
l (f (x) , y) = If (x)6=y which means that we count as an error whenever the given
point is misclassified. The minimization of the risk function for the classification
can be mapped then to the minimization of the margin 1/ kwk. For the regression
problem, the loss function is l (f (x) , y) = (f (x) − y)2 which means that we count
the loss as the error of regression.
Remark 3 We have chosen here the loss as the least-square error just for illustra-
tion. In general, it can be replaced by any positive function F of f (x) − y. Hence, we
have the loss function in general form l (f (x) , y) = F (f (x) − y). We remark that
the least-square case corresponds to L2 norm, then the most simple generalization
is to have the loss function as Lp norm l (f (x) , y) = |f (x) − y|p . We show later
that the special case with L1 can bring the regression problem to a similar form of
soft-margin classification.
In the last discussion on the classification, we have concluded that the linear-SVM
problem is just a special case of non-linear-SVM within kernel approach. Hence, we
will work here directly with non-linear case where the training vector x is already
transformed by a non-linear application φ (x). Therefore, the approximate function
of the regression reads f (x) = wT φ (x)+b. In the ERM framework, the risk function
is estimated simply as the empirical summation over the dataset:
n
1X
Remp = (f (xi ) − yi )2
n
i=1
The risk function in the VRM framework can be interpreted as a regulated form of
risk function in the ERM framework. We rewrite the risk function after renormalizing
it by the factor 2σ 2 :
X n
1 2
Rvic = kwk + C ξi2
2
i=1
with C = 1/2σ 2 n. Here, we have introduced new variables ξ = (ξi )i=1...n which
satisfy yi = f (xi ) + ξi = wT φ (xi ) + b + ξi . The regression problem can be now
72
Support Vector Machine in Finance
X n
1
min kwk2 + C ξi2
w,b,ξ 2
i=1
u.c. yi = wT φ (xi ) + b + ξi i = 1...n
In the present form, the regression looks very similar to the SVM problem for the
classification. We notice that the regression problem in the context of SVM can be
easily generalized by two possible ways:
• The first way is to introduce more general loss function F (f (xi ) − yi ) instead
of the least-square loss function. This generalization can lead to other type of
regression such as ε-SV regression proposed by Vapnik (1998).
• The second way is to introduce a weight ωi distribution for the empirical dis-
tribution instead of the uniform distribution:
n
X
dPemp (x, y) = ωi δxi (x)δyi (y)
i=1
73
Support Vector Machine in Finance
ii. ε-SVM regression The ε-SVM regression problem was introduced by Vapnik
(1998) in order to have a similar formalism with the soft-margin SVM. He
proposed to employ the loss function in the following form:
l (f (x) , y) = (|y − f (x)| − ε) I{|y−f (x)|≥ε}
The ε-SVM loss function is just a generalization of L1 error. Here, ε is an
additional tolerance parameter which allows us not to count the regression
error small than ε. Insert this loss function into the expression of risk function
then we obtain the objective of the optimization problem:
X n
1
Rvic = kwk2 + C (|f (xi ) − yi | − ε) I{|yi −f (xi )|≥ε}
2
i=1
Because the two ensembles {yi −f (xi ) ≥ ε} and {yi −f (xi ) ≤ −ε} are disjoint.
We now break the function I{|yi −f (xi )|≥ε} into two terms:
I{|yi −f (xi )|≥ε} = I{yi −f (xi )−ε≥0} + I{f (xi )−yi −ε≥}
By introducing the slack variables ξ and ξ 0 as the last case which satisfy the
condition ξi ≥ yi − f (xi ) − ε and ξi0 ≥ f (xi ) − yi − ε. Hence, we obtain the
following optimization problem:
X n
1
min 0 kwk2 + C ξi + ξi0
w,b,ξ ,ξ 2
i=1
T
u.c. w φ (xi ) + b − yi ≤ ε + ξi , ξi ≥ 0 i = 1...n
T
yi − w φ (xi ) − b ≤ ε + ξi0 , ξi0 ≥ 0 i = 1...n
Remark 4 We remark that our approach gives exactly the same result as the
traditional approach discussed in the work of Vapnik (1998) in which the ob-
jective function is constructed by minimizing the margin with additional terms
defining the regression error. These terms are controlled by the couple of slack
variables.
The dual problem in this case can be obtained by performing the same calcu-
lation as the soft-margin SVM:
T T 1 T
max Λ − Λ0 y − ε Λ + Λ0 1− Λ − Λ0 K Λ − Λ0 (3.10)
Λ,Λ0 2
T
u.c. Λ − Λ0 1 = 0, 0 ≤ Λ, Λ0 ≤ C1
For the particular case with ε = 0, we obtain:
1
max ΛT y − ΛT KΛ
Λ 2
T
u.c. Λ 1 = 0, |Λ| ≤ C1
74
Support Vector Machine in Finance
After the optimization procedure using QP program, we obtain the optimal vector
Λ? then compute b? by the KKT condition:
wT φ (xi ) + b − yi = 0
for support vectors (xi , yi ) (see Appendix C.1.3 for more detail). In order to have
good accuracy for the estimation of b, we average over the set of support vectors SV
and obtain:
n
XSV n
X
1 yi − αi?
b? = K (xi , xj )
nSV
i=1 j=1
Here, we have L (y, t) = (y − t)p for the regression problem whereas L (y, t) =
max (0, 1 − yt)p for the classification problem. In the case with quadratic loss or
L2 penalty, the function L (y, t) is differentiable with respect to the second variable
hence one can obtain the zero gradient equation. In the case where L (y, t) is not dif-
ferentiable such as L (y, t) = max (0, 1 − yt), we have to approximate it by a regular
function. Assuming that L (y, t) is differentiable with respect to t then we obtain:
n
X ∂L
w+C yi , wT φ (xi ) + b φ (xi ) = 0
∂t
i=1
75
Support Vector Machine in Finance
By introducing the kernel Kij = K (xi , xj ) = φ (xi )T φ (xj ) we rewrite the primal
problem as following:
Xn
1
min β T Kβ + C L yi , KiT β + b (3.12)
β ,b 2 i=1
where Ki is the ith column of the matrix K. We note that it is now an uncon-
strained optimization problem which can be solved by gradient descent whenever
L (y, t) is differentiable. In Appendix C.1, we present detail derivation of the primal
implementation in for the case quadratic loss and soft-margin classification.
In order to define the calibration procedure, let us first define the test function
which is used to estimate the SVM problem. In the case where we have a lot of
data, we can follow the traditional cross validation procedure by dividing the total
data in two independent sets: the training set and the validation set. The training
set {xi , yi }1≤i≤n is used for the optimization problem whereas the validation set
{x0i , yi0 }1≤i≤m is used to evaluate the error via the following test function:
m
1 X
T = ψ −yi0 f x0i
m
i=1
where ψ (x) = I{x>0} with IA the standard notation of the indicator function. In the
case where we do not have enough data for SVM problem, we can employ directly
the training set to evaluate the error via the “Leave-one-out error” . Let f 0 be the
classifier obtained by the full training set and f p be the one with the point (xp , yp )
left out. The error is defined by the test of the decision rule f p on the missing point
(xp , yp ) as following:
n
1X
T = ψ (−yp f p (xp ))
n
i=1
We focus more here the first test error function with available validation data set.
However, the error function requires the step function ψ which is discontinuous can
cause some difficulty if we expect to determine the best selection parameter via the
optimal test error. In order to perform the search for minimal test error by gradient
76
Support Vector Machine in Finance
descent for example, we should smooth the test error by regulate the step function
by:
1
ψ̃ (x) =
1 + exp (−Ax + B)
The choice of the parameter A, B are important. If A is too small the approximation
error is too much whereas if A is large the test error is not smooth enough for the
minimization procedure.
Recently, many important contributions have progressed the field both in the
accuracy and complexity (i.e. reduction of time computation). The extensions have
been developed via two main directions. The first one consists of dividing the multi-
classification problem into many binary classification problem by using “one-against-
all” strategy or “one-against-one”. The next step is to construct the decision function
in the recognition phase. The implementation of the decision for “one-against-all”
strategy is based on the maximum output among all binary SVMs. The outputs are
usually mapped into an estimation probability which are proposed by different au-
thors such as Platt (1999). For “one-against-one”strategy, in order to take the right
decision, the Max Wins algorithm is adopted. The resultant class is given by the
one voted by the majority of binary classifiers. Both techniques encounter the limi-
tation of complexity and high cost of computation time. Other improvement in the
same direction such as the binary decision tree (SVM-BDT) was recently proposed
by Madzaro G. et al., (2009). This technique proved to be able to speed up the
computation time. The second direction consist of generalizing the kernel concept
in the SVM algorithm into a more general form. This method treats directly the
multiclassification problem by writing a general form of the large margin problem.
It will be again mapped into the dual problem by incorporating the kernel concept.
77
Support Vector Machine in Finance
The most two popular extensions of single SVM classifier to multiclass SVM classifier
are using the one-against-all strategy and one-against-all strategy. Recently, another
technique utilizing the binary decision tree provided less effort in training the data
and it is much faster for recognition phase with a complexity of the order O [log2 N ].
All these techniques employ directly the above SVM implementation.
y = argmaxk∈{1...m} fk (x)
In order to avoid the error coming from the fact that we compare the output
corresponding to different classifiers, we can map the output of each SVM into
the same form of probability proposed by Platt (1999):
1
P̂ r ( ωk | fk (x)) =
1 + exp (Ak fk (x) + Bk )
78
Support Vector Machine in Finance
Pm
output fk (x). However, nothing guarantees that k=1 P̂ r ( ωk | fk (x)) = 1,
hence we have to renormalize this probability:
P̂ r ( ωk | fk (x))
P̂ r ( ωk | x) = Pm
j=1 P̂ r ( ωj | fj (x))
79
Support Vector Machine in Finance
By definition Wk is the vector defining the boundary which distinguishes the class
ωk from the rest. It is a normal vector to the boundary and point to the region
occupied by class ωk . Assuming that we are able to separate correctly all data by
classifier W. For any point (x, y) when we compute the position of x with respect to
two classes ωy and ωk for all k 6= y, we must find that x belongs to class ωy . As Wk
defines the vector pointing to the class ωk , hence when we compare a class ωy to a
class ωk , it is natural to define the vector Wy − Wk to define the vector point to class
ωy but not ωk . As consequence, Wk − Wy is the vector point to class ωk but not ωy .
When x is well classified, we must have Wy − Wk x > 0 (i. e. the class ωy has
T T
80
Support Vector Machine in Finance
the best score). In order to have a margin like the binary case, we impose strictly
that WyT − WkT x ≥ 1 ∀k 6= y. This condition can be translated for all k = 1 . . . m
by adding δy,k (the Kronecker symbol) as following:
WyT − WkT x + δy,k ≥ 1
Therefore, solving the multi-classification problem for training set (xi , yi )i=1...n is
equivalent to finding W satisfying:
WyTi − WkT xi + δyi ,k ≥ 1 ∀i, k
We notice here that w = WiT − WjT is normal vector to the separation boundary
Hw = z|wT z + bij = 0 between two classes ωi and ωj . Hence the width of the
margin between two classes is as in the binary case:
1
M (Hw ) =
kwk
Maximizing the margin is equivalent to minimizing
the norm kwk. Indeed, we have
kwk = kWi − Wj k ≤ 2 kWi k + kWj k . In order to maximize all the margin at
2 2 2 2
the same time, it turns out that we have to minimize the L2 -norm of the matrix W:
m
X m X
X d
kWk22 = kWi k2 = Wij2
i=1 i=1 j=1
The extension the similar case with“soft-margin” can be formulated easily bu in-
troducing the slack variables ξi corresponding for each training data. As before,
this slack variable allow the point to be classified in the margin. The minimization
problem now becomes:
n
!
1 X
min kWk2 + C.F ξip
W,ξ 2 i=1
T T
u.c. Wyi − Wk xi + δyi ,k ≥ 1 − ξi , ξi ≥ 0 ∀i, k
Remark 8 Within the ERM or V RM frameworks, we can construct the risk func-
tion via the loss function l (x) = I{F (x)6=y} for the couple of data (x, y). For example,
in the ERM framework, we have:
n
1X
Remp (W) = I{F (xi )6=yi }
n
i=1
81
Support Vector Machine in Finance
The classification problem is now equivalent to find the optimal matrix W? which
minimizes the empirical risk function. In the binary case, we have seen that the
optimization of risk function is equivalent to maximizing the margin kwk2 under
linear constraint. We remark that in VRM framework, this problem can be tackled
exactly as the binary case. In order to prove the equivalence of minimizing the risk
function with the large margin principle, we look for a linear superior boundary the
indicator function I{F (x)6=y} . As shown in Crammer K. et al., (2001), we consider
the following function:
g (x, y; k) = WkT − WyT x + 1 − δy,k
If the the data is separable, then the optimal value of the risk function is zero. If one
require that the superior boundary of the risk function is zero, then the W? which
optimizes this boundary must be the one optimize Remp (W). The minimization can
be expressed as:
max WkT − WyTi xi + 1 − δyi ,k = 0 ∀i
k
or in the same form of the large margin problem:
WyTi − WkT xi + 1 + δyi ,k ≥ 1 ∀i, k
Follow the traditional routine for solving this problem, we map it into the dual
problem as in the case with binary classification. The detail of this mapping is given
in K. Crammer and Y. Singer (2001). We summarize here their important result
in the dual form with the dual variable η i of dimension m with i = 1 . . . n. Define
τ i = 1yi − η i where 1yi is zero column vector except for ith element, then in the case
with soft margin p = 1 and F (u) = u we have the dual problem:
n
!
1X T T 1 X T
max Q (τ ) = − xi x j τ i τ j + τ i 1 yi
τi 2 C
i,j i=1
82
Support Vector Machine in Finance
We remark here again that we obtain a quadratic program which involves only the
interior product between all couples of vector xi , xj . Hence the generation to non-
linear is straight forward with the introduction of the kernel
concept. The general
problem is finally written by replacing the the factor xTi xj by the kernel K (xi , xj ):
n
!
1X 1 X
max Q (τ ) = − K (xi , xj ) τ Ti τ j + τ Ti 1yi (3.13)
τi 2 C
i,j i=1
The optimal solution of this problem allows to evaluate the classification rule:
( n )
X
H(x) = arg max τ i,r K (x, xi ) (3.15)
r=1...m
i=1
For small value of class number m, we can implement the above optimization by
the traditional QP program with matrix of size mn × mn. However, for important
number of class, we must employ efficient algorithm as stocking a mn×mn is already
a complicate problem. Crammer and Singer have introduced an interesting algorithm
which optimize this optimization problem both in stockade and computation speed.
83
Support Vector Machine in Finance
two model of time series. The first model is simply an determined trend perturbed
by a white noise:
yt = (t − a)3 + σN (0, 1) (3.16)
The second model for our tests is the Black-Scholes model of the stock price:
dSt
= µt dt + σt dBt (3.17)
St
We notice here that the studied signal yt = ln St . The parameters of the model are
the annualized return µ = 5% and the annulized volatility σ = 20%. We consider
the regression on a period of one year corresponding to N = 260 trading days.
The first test consists of comparing the L1 -regressor and L2 -regressor for Gaussian
kernel (see Figures 3.3-3.4). As shown in Figure 3.3 and Figure 3.4, the L2 -regressor
seems to be more favor for the regression. Indeed, we observe that the L2 -regressor is
more stable than the L1 -regressor (i.e L1 is more sensible to the training data set) via
many test on simulated data of Model 3.17. In the second test, we compare different
L2 regressions corresponding to four typical kernel: 1. Linear, 2. Polynomial, 3.
Gaussian, 4. Sigmoid.
Figure 3.3: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16)
20
15
10
5
yt
−5
−10
84
Support Vector Machine in Finance
Figure 3.4: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.17)
0.1
0.05
0
ln(St /S0 )
−0.05
−0.1
−0.15
−0.2
Real signal
L1 regression
L2 regression
−0.25
0 50 100 150 200 250 300
t
20
15
10
5
yt
−5
−10
Real signal
−15 Linear
Polynomial
Gaussian
Sigmoid
−20
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t
85
Support Vector Machine in Finance
0.15
0.1
0.05
0
yt
−0.05
−0.1
Real signal
−0.15 Linear
Polynomial
Gaussian
Sigmoid
−0.2
0 50 100 150 200 250 300
t
discussion. We apply now this technique for estimating the derivative of the trend
µ̄t , then plug it into a trend-following strategy.
with m the risk tolerance and σ̂t the estimator of volatility given by:
Z T t
X
1 1 Si
σ̂t2 = σt2 dt = ln2
T 0 T Si−1
i=t−T +1
86
Support Vector Machine in Finance
SVM-Filtering
We discuss now how to build a cross-validation procedure which can help to learn
the trend of a given signal. We employ the moving-average as a benchmark to
compare with this new filter. An important parameter in moving-average filtering
is the estimation horizon T then we use this horizon as a reference to calibrate
our SVM-filtering. For the sake of simplicity, we studied here only the SVM-filter
with Gaussian kernel and L2 penalty. The two typical parameters of SVM-filter
are C and σ. C is the parameter which allows some certain level of error in the
regression curve while σ characterizes the horizon of estimation and it is directly
proportional to T . We propose to scheme of the validation procedure which base
on the following structure of data division: training set, validation set and testing
set. In the first scheme, we fix the kernel parameter σ = T and optimize the error
tolerance parameter C on the validation set. This scheme is comparable to our
benchmark moving-average. The second scheme consists of optimizing both couple
of parameter C, σ on the validation set. In this case, we let our validation data
decides its estimation horizon. This scheme is more complicate to interpret as σ is
now a dynamic parameter. However, by affecting σ to the local horizon, we can have
an additional understanding on the change in the price of the underlying asset. For
example, we can determine in the historical data if the underlying asset undergoes a
period with long or short trend. It can help to recognize some additional signature
such as the cycle of between long and short trends. We report the two schemes in
the following algorithm.
Backtesting
We first check the SVM-filter with simulated data given by the Black-Scholes model of
the price. We consider a stock price with annualized return µ = 10% and annualized
volatility σ = 20%. The regression is based on 1 trading year data (n = 260 days)
with a fixed horizon of 1 month T = 20 days. In Figure 3.8, we present the result
of the SVM trend prediction with fixed horizon T = 20 whereas Figure 3.9 presents
the SVM trend prediction for the second scheme.
87
Support Vector Machine in Finance
0.15
0.1
0.05
0
yt
−0.05
−0.1
0.15
0.1
0.05
0
yt
−0.05
−0.1
88
Support Vector Machine in Finance
period of n dates. The performance of the index or an individual stock that we are
interested in is given by y. We are looking for the prediction of the value of yn+1 by
using the regression of the historical data of (Xt , yt )t=1...n . In this case, the different
stocks play the role of the factors of vector in the training set. We can as well apply
other regression like the prediction of the performance of the stock based on available
information of all the factors.
Multivariate regression
We first test here the efficiency of the multivariate regression on a simulated model.
Assuming that all the factors at a given date j follow a Brownian motion.
(i) (i)
dXt = µt dt + σt dBt ∀i = 1...d
Let (yt )1ṅ be the vector to be regressed which is related to the input X by a function:
yt = f (Xt ) = WtT Xt
We would like to regress the vector y = (yt )t=2...n by the historical data (Xt )t=1...n−1
by SVM-regression. This regression is give by the function yt = F (Xt−1 ). Hence,
the prediction of the future performance of yn+1 is given by:
In Figure 3.10, we present the results obtained by Gaussian kernel with L1 and
L2 penalty condition whereas in Figure 3.11, we compare the result obtained with
different types of kernel. Here, we consider just a simple scheme with the lag of one
trading day for the regression. In all Figures, we remark this lack on the prediction
of the value of y.
89
Support Vector Machine in Finance
Figure 3.10: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16)
2
yt
−1
−2
Real signal
L1 regression
L2 regression
−3
0 50 100 150 200 250 300 350 400 450 500
t
2
yt
−1
Real signal
Linear
−2 Polynomial
Gaussian
Sigmoid
−3
0 50 100 150 200 250
t
90
Support Vector Machine in Finance
Backtesting
Binary-SVM classifier
Let us compare here the two proposed approaches (dual/primal) for solving numeri-
cally SVM-classification problem. In order to realize the test, we consider a random
training data set of n vector xi with classification criterion yi = sign (xi ). We present
here the comparison of two classification approaches with linear kernel. Here, the
result of primal approach is directly obtained by the software of O. Chapelle 2 . This
software was implemented with L2 penalty condition. Our dual solver is implemented
for both L1 and L2 penalty conditions by employing simply the QP program. In Fig-
ure 3.12, we show the results of classification obtained by both methods within L2
penalty condition.
We test next the non-linear classification by using the Gaussian kernel (RBF
kernel) for the binary dual-solver. We generate the simulated data by the same way
as the last example with x ∈ R2 . The result of the classification is illustrated in
Figure 3.13 for RBF kernel with parameter C = 0.5 and σ = 2 3 .
Multi-SVM classifier
We first test the implementation of SVM-BDT via simulated data (xi )i=1...n which
are generated randomly. We suppose that these data are distributed in Nc classes.
In order to test efficiently our multi-SVM implementation, the response vector y =
2
The free software of O. Chapelle can be found in the following website
http://olivier.chapelle.cc/primal/
3
We used here the “plotlssvm ” function of the LS-SVM toolbox for graphical illustration. Similar
result was aso obtained by using “trainlssvm” function in the same toolbox.
91
Support Vector Machine in Finance
2
h(x, y)
−2
−4
0 10 20 30 40 50 60 70 80 90 100
Training data
2.5
1
class 1
2 class 2
1.5
1
1
0.5
1
X2
−0.5
1
−1
−1.5
−2
−2.5
1
92
Support Vector Machine in Finance
(y1 . . . yn ) is supposed to be dependent only on the first coordinate of the data vector:
z = U (0, 1)
x1 = Nc z
y = [x1 ] + N (0, 1)
xi = U (0, 1) ∀i > 1
Here [a] denote the part of a. We can generate our simulated data in much more
general way but it will be very hard to visualize the result of the classification.
Within the above choice of simulated data, we can see that in the case = 0 the
data a separable in the axis x1 . In the geometric view, the space Rd is divided in to
Nc zones along the axis x1 : Rd−1 × [0, 1[, . . . , Rd−1 × [Nc , Nc + 1[. The boundaries
are simply the Nc hyperplane Rd−1 crossing x1 = 1 . . . Nc . When we introduce some
noise on the coordinate x1 ( > 0), then the training set is now is not separable
by these ensemble of linear hyperplanes. There will be some misclassified points
and some deformation of the boundaries thank to non-linear kernel. For the sake of
simplicity, we assume that the data (x, y) are already gathered by group. In Figures
?? and 3.15, we present the classification results for in-sample data and out-of-simple
data in the case = 0 (i.e. separable data). We are now introduce the noise in the
C10
C09
C08
C07
C06
Classes
C05
C04
C03
C02
Real sector distribution
Multiclass SVM
C01
S10 S20 S30 S40 S50 S60 S70 S80 S90 S99
Stocks
93
Support Vector Machine in Finance
C10
C09
C08
C07
C06
Classes
C05
C04
C03
C02
Real sector distribution
Multiclass SVM
C01
S05 S10 S15 S20 S25 S30 S35 S40 S45 S50
Stocks
1.2
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
0.8
x2
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9
x1
94
Support Vector Machine in Finance
1.2
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
0.8
x2
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
x1
95
Support Vector Machine in Finance
Consumer Services
Utilities
Technology
Consumer Goods
Basic Materials
Health Care
Telecommunications
Financials
Industrials
Real sector distribution
Multiclass SVM
Oil & Gas
S1 S10 S20 S30 S40 S50 S60 S70 S80 S90 S100
Calibration procedure
As discussed above in the implementation part of the SVM-solver, there are two
kinds of parameter which play important role in the classification process. The first
parameter C concerns the tolerance error of the margin and the second parameters
concern the choice of kernel (σ for Gaussian kernel for example). In last example,
we have optimized the couple of parameters C, σ in order to have the best classifiers
which do not commit any error on the traing set. However, this result is true only
in the case if the sectors are correctly defined. Here, nothing guaranties that the
given notion of sectors is the most appropriate one. Hence, the classification process
should consist of two steps: (i) determine of binary SVM classifiers on training data
set and (ii) calibrate parameters on the validation set. In fact, we decide to optimize
this couple of parameters C, σ by minimizing the realized error on the validation
set because the committed error on the training set (learning set) must be always
smaller than the one on validation set (unknown set). In the second phase, we can
redefine the sectors in the sens that if any asset is misclassified, we change its sector
label and repeat the optimization on the validation set until convergence. In the end
of the calibration procedure, we expect to obtain first a new recognition of sectors
and second a multi-classifiers for new assets.
As SVM uses the training set to lean about the classification, it must commits
less error on this set than on the validation set. We propose here to optimize the
96
Support Vector Machine in Finance
Consumer Services
Real sector distribution
Multiclass SVM
Utilities
Technology
Consumer Goods
Basic Materials
Health Care
Telecommunications
Financials
Industrials
SVM parameters by minimizing the error on the validation set. We use the same
error function defined in Section 3 but apply it on the validation data set V:
1 X
Error = ψ −yi0 f x0i
card (V)
i∈V
where ψ (x) = I{x>0} with IA the standard notation of the indicator function. How-
ever, the error function requires the step function ψ which is discontinuous can cause
some difficulty if we expect to determine the best selection parameter via the optimal
test error. In order to perform the search for minimal test error by gradient descent
for example, we should smooth the test error by regulate the step function by:
1
ψ̃ (x) =
1 + exp (−Ax + B)
The choice of the parameter A, B are important. If A is too small the approximation
error is too much whereas if A is large the test error is not smooth enough for the
minimization procedure.
Recognition of sectors
By construction, SVM-classifier is a very efficient method for recognize and classify
a new element with respect to a given number of classes. However, it is not able to
recognize the sectors or introduces a new correct definition of available sectors over
an universal of available data (stocks). In finance, the classification by sector is more
97
Support Vector Machine in Finance
related to the origin of stock than the intrinsic property of the stock in the market.
It may introduce some problem on the trading strategy if a stock is misclassified, for
example, the case of pair-trading strategy. Here, we try to overcome this weakness
point of SVM in order to introduce a method which modifies the initial definition of
sectors.
The main idea of sector recognition procedure is the following. We divide the
available data into two sets: training set and validation set. We employ the training
set to learn about the classification and the validation set to optimize the SVM
parameters. We start with the initial definition of the given sectors. Within each
iteration, we learn the training set in order to determine the classifiers then we test
the validation error. An optimization procedure on the validation error helps us to
determine the optimal parameters of SVM. For each ensemble of optimal parameters,
we encounter some error on the training set. If the validation is smaller on certain
threshold with no error on the training set, we reach the optimal configuration of
sector definition. In the case, there are errors on the training set, we relabel the
misclassified data point and define new sectors with this correction. All the sector
labels will be changed by this rule for both training and validation sets. The iteration
procedure will be repeat until no error on the training set is committed for a given
expected threshold of error on the validation set. The algorithm of this sector-
recognition procedure is summarized in the following table:
98
Support Vector Machine in Finance
must satisfy some classification criterion, for example the performance. We denote
here the (xi )i=1...n with xi the ensemble of factors for the ith stock. The classification
criterion such as the performance is denoted by the vector y = (yi )i=1...n . The aim
of SVM-classifier in this problem is to recognize which stocks (scores) belong to the
high/low performance class (overperformed/underperformed). More precisely, we
have to identify the a boundary of separation as a function of score and performance
f (x, y). Hence, the SVM stock peaking consists of two steps: (i) construction of
factors ensemble (i.e. harmonize all characterizations of a given stock such as the
price, the risk, marco-properties e.t.c into comparable quantities); (ii) application of
SVM-classifier algorithm with adaptive choice of parameters. In the following, we
are going to first give a brief description of score constructions and then establish
the backtest on stock-picking strategy.
Y ? = XT β + α +
S = P r (Y = 1|X)
99
Support Vector Machine in Finance
Using the estimated parameters by maximum likehood, we can predict the score of
the a given asset with its factor vector X as following:
Ŝ = Φ XT β̂ + α̂
The probability distribution of the score Ŝ can be computed by the empirical formula
1X n
P r Ŝ < s = I{Si <s}
n
i=1
Y0 = XT β0 + α0 + N (0, 1)
Y = I{Y0 >0}
Here, the parameters of the model α0 and β0 are chosen as α0 = 0.1 and β0 = 1. We
employ the Probit regression in order to determine the score of n = 500 data in the
cases d = 2 and d = 5. The comparisons between the Probit score and the simulated
score are presented in Figures 3.20-3.22
100
Support Vector Machine in Finance
Figure 3.20: Comparison between simulated score and Probit score for d = 2
0.7
0.65
0.6
0.55
Score
0.5
0.45
0.4
0.35
Simulated score
Probit score
0 50 100 150 200 250 300 350 400 450 500
Assets
Figure 3.21: Comparison between simulated score CDF and Probit score CDF for
d=2
0.9
0.8
0.7
0.6
CDF
0.5
0.4
0.3
0.2
0.1
Simulated CDF
Probit CDF
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Score
101
Support Vector Machine in Finance
Figure 3.22: Comparison between simulated score PDF and Probit score PDF for
d=2
6
Simulated PDF
Probit PDF
4
PDF
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Score
of SVM algorithm based on SVM-classification for building the scores which allow
later to implement long/short strategies by using the selection curves. Our main
idea of SVM-score construction is very similar to Probit model. We first define a
binary variable Yi = ±1 associated to each asset xi . This variable characterizes the
performance of the asset with respect to the benchmark. If Yi = −1, the stock is
underperformed whereas Yi = 1 the stock is overperformed. We next employ the
binary SVM-classification to separate the universal of stocks into two classes: high
performance and low performance. Finally, we define the score of each stock the its
distance to the boundary decision.
Selection curve
102
Support Vector Machine in Finance
selection curve for which the score plays the role of the parameter:
Q (s) = P r (S ≥ s)
E (s) = P r (S ≥ s |Y = 0 )
∀ s ∈ [0, 1]
This parametric curve can be traced in the the square [0, 1] × [0, 1] as shown in
Figure 3.23. On the x-axis, Q (s) defines the quantile corresponding to the stock
selection among the considered universal of stocks. On the y-axis, E (s) defines
the committed error corresponding to the stock selection. Precisely, for a certain
quantile, it measures the chance that we pick the bad performance stock. Two
trivial limits are the points (0, 0) and (1, 1). The first point corresponds to the limit
with no selection whereas the second point corresponds to the limit with all selection.
A good score construction method should allow a selection curve as much convex as
possible because it guaranties a selection with less error.
Figure 3.23: Selection curve for long strategy for simulated data and Probit model
1
Simulated data
0.9 Probit model
0.8
0.7
P r(S > s|Y = 0)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P r(S > s)
Reciprocally, for a short strategy, the selection curve can be obtained by tracing
the following parametric curve:
Q (s) = P r (S ≤ s)
E (s) = P r (S ≤ s |Y = 1 )
∀ s ∈ [0, 1]
Here, Q (s) aims us to determine the quantile of low-performance stocks to be shorted
while E (s) helps us to avoid selling the high-performance one. As the selection
103
Support Vector Machine in Finance
0.7
P r(S > s|Y = 0)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P r(S > s)
As presented in the last discussion on the regression, we have to build a cross valida-
tion procedure to optimize the SVM parameters. We follow the traditional routine
by dividing the data in three independent sets: (i)training set, (ii)validation set and
(iii)testing set. The classifier is obtained by the training set whereas its optimal pa-
rameters (C, σ) will be obtained by minimizing the fitting error on the validation set.
The efficiency of the SVM algorithm will be finally checked on the testing set. We
summarize the cross-validation procedure in the below algorithm. In order to make
the training set close to both validation data and testing data, we decide to divide
the data in the the following time order: validation set, training set and testing set.
Using this way, the prediction score on the testing set contains more information in
the recent past.
We now employ this procedure to compute the SVM score on the universal of
stocks of Eurostoxx index. Figure 3.25 present the construction of the score basing
on the the training set and validation set. The SVM parameters are optimized on
104
Support Vector Machine in Finance
the validation set while the final score construction uses both training and validation
set in order to have largest data ensemble.
1
SVM Training
0.9
SVM Validation
0.8 SVM Testing
0.7
P r(S > s|Y = 0)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P r(S > s)
3.7 Conclusion
Support vector machine is a well-established method with a very wide use in various
domain. In the financial point of view, this method can be used to recognize and
to predict the high performance stocks. Hence, SVM is a good indicator to build
efficients trading strategy over an universal of stocks. Within this paper, we first
have revisited the basic idea of SVM in both classification and regression contexts.
105
Support Vector Machine in Finance
106
Bibliography
[3] Basak D., Pal S. and Patranabis D.J. (2007), Support Vector Regression,
Neural Information Processing, 11, pp. 203-224.
[4] Ben-Hur A. and Weston J. (2010), A User’s Guide to Support Vector Ma-
chines, Methods In Molecular Biology Clifton Nj, 609, pp. 223-239.
[7] Chapelle O. et al., (2002), Choosing Multiple Parameters for Support Vector
Machine, Machine Learning, 46, pp. 131-159.
[8] Chapelle O. (2007), Training a Support Vector Machine in the Primal, Journal
Neural Computation, 19, pp. 1155-1178.
[11] Gestel T. V. et al., (2001), Financial Time Series Prediction Using Least
Squares Support Vector Machines Within the Evidence Framework, IEEE
Transactions on neural Networks, 12, pp. 809-820.
107
Support Vector Machine in Finance
[13] Milgram J. et al., (2009), “One Against One” or “One Against All”: Which One
is Better for Handwriting Recognition with SVMs? (2006) ,Tenth International
Workshop on Frontiers in Handwriting Recognition.
[16] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Finan-
cial Times Series forecasting,Neurocomputing,48, pp. 847-861
[17] Tsochantaridis I. et al., (2004), Support Vector Machine Learning for Inter-
dependent and Structured Output Spaces,Proceedings of the 21 st International
Confer- ence on Machine Learning,Banff, Canada.
[18] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons,New York.
108
Chapter 4
We review in this chapter the trend-following strategies within Kalman filter and
study the impact of the trend estimator error. We first study the case of momentum
strategy on uni-asset case then generalize the analysis to the multi-asset case. In
order to construct the allocation strategy, we employ the observed trend which is
filtered by exponential moving average. It can be demonstrated that the cumulated
return of the strategy can be broken down into two important parts: the option profile
which is similar in concept to the straddle profile suggested by Fung and Hsied (2001)
and the trading impact which involves directly the estimator error on the efficiency of
strategy. We focus in this paper on the second quantity by estimating its probability
distribution function and associated gain and loss expectations. We illustrate how the
number of assets and their correlations influence on the performance of a strategy
via a “toy model”. This study can reveal important results which can be directly
tested on CTA fund such as the “Epsilon ”fund.
4.1 Introduction
Trend-following strategies are specific example of an investment style that emerged
as an industry recently. They are so-called Commodity Trading Advisors (CTA)
and play an important role in the Hedge Fund industry (15% of total Hedge Fund
AUMs). Recently, this investment style has been carefully reviewed and analyzed
the 7th White Paper of Lyxor edition. We present here a complementary result
of this nice paper and give more specific analysis on a typical CTA. We will focus
here on the trading impact by estimating its probability distribution function and
associated gain and loss expectations. We illustrate how the number of assets and
their correlations influence on the performance of a strategy via a “Toy model”. This
109
Analysis of Trading Impact in the CTA strategy
study can reveal important results which can be directly tested on CTA fund such
as the “Epsilon ”fund.
This chapter is organized as following. In the first part, we remind the main result
of trend-following strategy in the univariate case which has been demonstrated in
the 7th White Paper of Lyxor. We next generalize this result into the multivariate
case which establishes a framework for studying the impacts of the correlation and
the number of assets in a CTA fund. Finally, we finish with the study of a toy model
which permits to understand the efficiency of trend-following strategy.
4.2 Conclusion
Momentum strategies are efficient ways to use the market tendency for building trad-
ing strategies. Hence, a good estimator of the trend is essential from this perspective.
In this paper, we study the impact of estimator error on a trend-following strategy
both in the single asset case and multi-asset case. The objective of this paper is
twofold. First, we have establish the general framework for analyzing a CTA fund.
Second, we illustrate important results of the trading impact on CTA strategy via a
simple “Toy Model ”. We have shown that the gain probability and gain expectation
depend strongly on the correlation and the number of assets. Increasing the number
of asset can help to improve the performance and reduce the risk (volatility) within
a momentum strategy. However, when the number of asset reaches certain limit, we
observe a saturation of performance. It implies that above this limit, putting more
assets does not improve too much the performance but it does make the strategy
more complicate and increase the management cost as the portfolio is rebalanced
frequently. The correlation of between assets play an important role as well. As
usual, the higher correlation level is, the less efficient strategies are. Interestingly,
we remark that when the correlation increases, we approach the limit of single asset
in which the gain probability is small than the loss probability but the conditional
expectation of gain is much higher than the conditional expectation of loss.
110
Bibliography
[5] Khatri C. G.(1978), A remark on the necessary and sufficient conditions for a
quadratic form to be distributed as a chi-square, Biometrika, 65, pp. 239-240.
[6] Kotz S., Johnson N.L. and Boyd D.W. (1967), Series Representations of
Distributions of Quadratic Forms in Normal Variables II. Non-Central Case,
The Annals of Mathematical Statistics, 38, pp. 838-848.
[7] Murison R. (2005), Distribution theory and inference, School of Science and
Technology , ch. 6, pp. 86-88.
[9] Ruben H.(1962), A New Result on the Distribution of Quadratic Forms, The
Annals of Mathematical Statistics, 34, pp. 1582-1584.
[10] Shah B.K. (1963) Distribution of Definite and of Indefinite Quadratic Forms
from a Non- Central Normal Distribution, The Annals of Mathematical Statis-
tics, 34, pp. 186-190.
[11] Shah B.K. and Khatri C.G. (1961) Distribution of a Definite Quadratic Form
for Non-Central Normal Variates, The Annals of Mathematical Statistics, 32,
pp. 883-887.
111
Analysis of Trading Impact in the CTA strategy
112
Conclusions
Within the internship in the R&D team of Lyxor Asset Management, I had chance
to work on many interesting topics concerning the quantitative asset management.
Beyond of this report, the resutls obtained during the stay have been employed for
the 8th edition of the Lyxor White Paper communication. The main results of this
intership can be divided into three grand lines. The first results consists of improving
the trend and volatility estimations which are important quantities for implementing
dynamical strategies. The second main results concern the application of the machine
learning technology in finance. We expect to employ the “Support vector machine”
for forcasting the expected return of financial assets and for having a criterial for
stock selection. The third main result is devoted for the analysis of the performance
of trend-following strategy (CTA) in the general case. It consists of studying the
efficiency of CTA within the changes in the market such as the correlation between
the assets, or their performance.
In the first part, we focused on improving the trend and volatility estimations in
order to implement two crucial momentum-strategies: trend-following and voltarget.
We show that we can use L1 filters to forecast the trend of the market in a very
simple way. We also propose a cross-validation procedure to calibrate the optimal
regularization parameter λ where the only information to provide is the investment
time horizon. More sophisticated models based on a local and global trends is also
discussed. We remark that these models can reflect the effect of mean-reverting to
the global trend of the market. Finally, we consider several backtests on the S&P 500
index and obtain competing results with respect to the traditional moving-average
filter. On another hand, voltarget strategies are efficient ways to control the risk
for building trading strategies. Hence, a good estimator of the volatility is essential
from this perspective. In this report, we present the improvement on the forecasting
of volatility by using some novel technologies. The use of high and low prices is
less important for the index as it gives more and less the same result with tradi-
tional close-to-close estimator. However, for independent stock with higher volatility
level, the high-low estimators improves the prediction of volatility. We consider sev-
eral backtests on the S&P 500 index and obtain competing results with respect to
the traditional moving-average estimator of volatility. Indeed, we consider a simple
stochastic volatility model which permit to integrate the dynamics of the volatility in
the estimator. An optimization scheme via the maximum likehood algorithm allows
us to obtain dynamically the optimal averaging window. We also compare these
Analysis of Trading Impact in the CTA strategy
results for range-based estimator with the well-known IGARCH model. The com-
parison between the optimal value of the likehood functions for various estimators
gives us also a ranking of estimation error. Finally, we studied the high frequency
volatility estimator which is a very active topic of financial mathematics. Using sim-
ple model proposed by Zhang et al, (2005), we show that the microstructure noise
can be eliminated by the two time scale estimator.
114
Appendix A
Appendix of chaper 1
This problem can be solved by considering the dual problem which is a QP program.
We first rewrite the primal problem with new variable z = Dx:
1
min ky − xk22 + λ kzk1
2
u.c. z = Dx
We construct now the Lagrangian function with the dual variable ν ∈ Rn−2 :
1
L (x, z, ν) = ky − xk22 + λ kzk1 + ν > (Dx − z)
2
The dual objective function is obtained in the following way:
1
inf x,z L (x, z, ν) = − ν > DD> ν + y > D> ν
2
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is
equivalent to the dual problem:
1 >
min ν DD> ν − y > D> ν
2
u.c. −λ1 ≤ ν ≤ λ1
x? = y − D > ν
115
Analysis of Trading Impact in the CTA strategy
The L1 − C filter
The optimization procedure for L1 − C filter follows the same strategy as the L1 − T
filter. We obtain the same quadratic program with the D operator replaced by
(n − 1) × n matrix which is the discrete version of the first order derivative:
−1 1 0
0 −1 1 0
. .
D=
.
−1 1 0
−1 1
The L1 − T C filter
In order to follow the same strategy presented above, we introduce two additional
variables z1 = D1 x and z2 = D2 x. The initial problem becomes:
1
min ky − xk22 + λ1 kz1 k1 + λ2 kz2 k1
2
z 1 = D1 x
u.c.
z 2 = D2 x
The Lagrangian function with the dual variables ν1 ∈ Rn−1 and ν2 ∈ Rn−2 is:
1
L (x, z1 , z2 , ν1 , ν2 ) = ky − xk22 +λ1 kz1 k1 +λ2 kz2 k1 +ν1> (D1 x − z1 )+ν2> (D2 x − z2 )
2
whereas the dual objective function is:
1
2
inf x,z1 ,z2 L (x, z1 , z2 , ν1 , ν2 ) = −
D1> ν1 + D2> ν2
+ y > D1> ν1 + D2> ν2
2 2
116
Analysis of Trading Impact in the CTA strategy
Pm
Let us define ȳ = (ȳt ) with ȳt = m−1 i=1 y
(i) . The dual objective function becomes:
1 X (i) >
m
1
inf x,z L (x, z, ν) = − ν > DD> ν + ȳ > D> ν + y − ȳ y (i) − ȳ
2 2
i=1
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is
equivalent to the dual problem:
1 >
min ν DD> ν − ȳ > D> ν
2
u.c. −λ1 ≤ ν ≤ λ1
min f0 (x)
Ax = b
u.c.
fi (x) < 0 for i = 1, . . . , m
117
Analysis of Trading Impact in the CTA strategy
The solution of rτ (x, λ, ν) = 0 can be obtained by Newton’s iteration for the triple
y = (x, λ, ν):
rτ (y + ∆y) ' rτ (y) + ∇rτ (y) ∆y = 0
This equation gives the Newton’s step ∆y = −∇rτ (y)−1 rτ (y) which defines the
search direction.
118
Analysis of Trading Impact in the CTA strategy
−1
For the L2 filter, we k now that the solution is x̂HP = 1 + 2λDT D y. Therefore,
the spectral density is:
2
1
f HP (ω) =
1 + 4λ (3 − 4 cos ω + cos 2ω)
2
1
'
1 + 2λω 4
The width of the spectral density for the L2 filter is then (2λ)−1/4 whereas it is 2πT −1
for the moving-average filter. Calibrate the L2 filter could be done by matching this
two quantities. Finally, we obtain the following relationship:
4
1 T
λ ∝ λ? =
2 2π
In Figure A.1, we represent the spectral density of the moving-average filter for dif-
ferent windows T . We report also the spectral density of the corresponding L2 filters.
For that, we have calibrated the optimal parameter λ? by least square minimization.
In Figure A.2, we compare the optimal estimator λ? with the one corresponding to
10.27 × λ? . We notice that the approximation is very good.
119
Analysis of Trading Impact in the CTA strategy
Figure A.2: Relationship between the value of λ and the length of the moving-average
filter
120
Analysis of Trading Impact in the CTA strategy
121
Appendix B
Appendix of chapter 2
The conditional expectation with respect to the couple σu and µu which are supposed
to be independent to dWu is given by:
Z ti Z ti !2
2 2 1 2
E Rti |σ, µ = σu du + µu du − σu du
ti−1 ti−1 2
Z !2 Z ! Z !2
ti ti ti
1 2
var Rt2i |σ, µ = 2 σu2 du +4 σu2 du µu du − σu du (B.1)
ti−1 ti−1 ti−1 2
123
Analysis of Trading Impact in the CTA strategy
We remark that when the time step (ti√− ti−1 ) becomes small, the estimator becomes
unbiased with its standard deviation 2 (ti − ti−1 ) σt2i−1 . This error is directly pro-
portional to the quantity to be estimated.
We observe that his estimator is weakly biased, however this effect is totally neg-
ligible. If we consider a volatility of 20% with a trend of 10%, the estimation of
volatility is 20.006% instead of 20%.
The variance of the canonical estimator (estimation error) reads:
n Z !2 Z ! Z !2
X ti ti ti
1 2
2 σu2 du +4 σu2 du µu du − σu du
ti−1 ti−1 ti−1 2
i=1
If the recorded time ti are regularly distributed with time-spacing ∆t, then we have:
!
Xn
var Rt2i σ, µ ≈ 2σ 4 (tn − t0 ) ∆
i=1
124
Appendix C
Appendix of chapter 3
In order to get the dual problem, we construct the Lagrangian for inequality con-
strains by introducing positive Lagrange multipliers Λ = (α1 , . . . , αi ) ≥ 0:
n
X n
1 X
L (w, b, Λ) = kwk2 − α i yi w T x i + b + αi
2
i=1 i=1
In minimizing the Lagrangian with respect to (w, b), we obtain the following equa-
tions:
X n
∂L
= w − α i yi xi = 0
∂wT
i=1
n
X
∂L
=− α i yi = 0
∂b
i=1
Insert these results into the Lagrangian, we obtain the dual objective LD function
with respect to the variable w:
1
LD (Λ) = ΛT 1 − ΛT DΛ
2
125
Analysis of Trading Impact in the CTA strategy
1
max ΛT 1 − ΛT DΛ
Λ 2
u.c. ΛT y = 0, Λ ≥ 0
We turn now to the soft-margin SVM classifier with L1 constrain case F (u) = u, p =
1. We first write down the primal problem:
n
!
1 X
min kwk2 + C.F ξip
w,b,ξ 2 i=1
T
u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n
For both case, we construct Lagrangian by introducing the couple of Lagrange mul-
tiplier (Λ, µ) for 2n constraints.
n
! n n
1 X X X
L (w, b, Λ, µ) = kwk2 + C.F ξi − αi yi wT xi + b − 1 + ξi − µi ξi
2
i=1 i=1 i=1
X n
∂L
= w − α i yi xi = 0
∂wT
i=1
n
X
∂L
=− α i yi = 0
∂b
i=1
∂L
=C −Λ−µ=0
∂ξ
with inequality constraints Λ ≥ 0 and µ ≥ 0. Insert these results into the Lagrangian
leads to the dual problem:
1
max ΛT 1 − ΛT DΛ (C.1)
Λ 2
T
u.c. Λ y = 0, 0 ≤ Λ ≤ C1
126
Analysis of Trading Impact in the CTA strategy
with Λ = (αi )i=1...n , Λ0 = (βi )i=1...n and the following constraints on the Lagrange
multipliers Λ, Λ0 , µ, µ0 ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ)
gives us:
X n
∂L
= w − (αi − βi ) yi xi = 0
∂wT
i=1
n
X
∂L
= (βi − αi ) yi = 0
∂b
i=1
∂L
= CI − Λ − µ = 0
∂ξ
∂L
= CI − Λ0 − µ0 = 0
∂ξ 0
Insert these results into the Lagrangian leads to the dual problem:
T T 1 T
max Λ − Λ0 y − ε Λ + Λ0 1− Λ − Λ0 K Λ − Λ0 (C.2)
Λ,Λ0 2
T
u.c. Λ − Λ0 1 = 0, 0 ≤ Λ, Λ0 ≤ C1
T
When ε = 0, the term ε (Λ + Λ0 ) 1 in the objective function disappears, then we
can reduce the optimization problem by changing variable (Λ − Λ0 ) → Λ. The
inequality constrain for new variable reads |Λ| < CI.
127
Analysis of Trading Impact in the CTA strategy
The dual problem can be solved by the QP program which gives the optimal
solution Λ? . In order to compute b, we use the KKT condition:
αi wT φ (xi ) + b − yi + ε + ξi = 0
βi yi − wT φ (xi ) − b + ε + ξi = 0
(C − αi ) ξi = 0
(C − βi ) ξi0 = 0
We remark that the two last conditions give us: ξi = 0 for 0 < αi < C and ξi0 = 0
for 0 < βi < C. This result implies direclty the following condition for all support
vectors of training set (xi , yi ):
wT φ (xi ) + b − yi = 0
Pn
We denote here SV the set of support vectors. Using the condition w = i=1 (αi − βi ) φ (xi )
and averaging over the training set, we obtain finally:
nSV
1 X
b= (yi − (z)i ) = 0
nSV
i
with z = K (Λ − Λ0 ).
The required condition of this scheme is that the function L (y, t) is differentiable.
We study first the case of quadratic loss where L (y, t) is differentiable then the case
with soft-margin where we have to regularize L (y, t).
128
Analysis of Trading Impact in the CTA strategy
Then the Newton iteration consists of updating the vector (bβ)T until convergence
as following:
b b
← − γH −1 ∇LP
β β
129
Published paper in the Lyxor White Paper Series:
Issue #8
W H I T E PA PE R
T R E N D F I LT E R I N G
METHODS FOR
M O M E N T U M S T R AT E G I E S
Foreword
The widespread endeavor to “identify” trends in market prices has given rise to a signif-
icant amount of literature. Elliott Wave Principles, Dow Theory, Business cycles, among
many others, are common examples of attempts to better understand the nature of market
prices trends.
Unfortunately this literature often proves frustrating. In their attempt to discover new
rules, many authors eventually lack precision and forget to apply basic research methodology.
Results are indeed often presented without any reference neither to necessary hypotheses nor
to confidence intervals. As a result, it is difficult for investors to find there firm guidance
and to differentiate phonies from the real McCoy.
This said, attempts to differentiate meaningful information from exogenous noise lie at
the core of modern Statistics and Time Series Analysis. Time Series Analysis follows similar
goals as the above mentioned approaches but in a manner which can be tested. Today more
than ever, modern computing capacities can allow anybody to implement quite powerful
tools and to independently tackle trend estimation issues. The primary aim of this 8th
White Paper is to act as a comprehensive and simple handbook to the most
widespread trend measurement techniques.
Even equipped with refined measurement tools, investors have still to remain wary about
their representation of trends. Trends are sometimes thought about as some hidden force
pushing markets up or down. In this deterministic view, trends should persist.
However, random walks also generate trends! Five reds drawn in a row from a non
biased roulette wheel do not give any clue about the next drawn color. It is just a past trend
with nothing to do with any underlying structure but a mere succession of independent
events. And the bottom line is that none of those two hypotheses can be confirmed or
dismissed with certainty.
As a consequence, overfitting issues constitute one of the most serious pitfalls in applying
trend filtering techniques in finance. Designing effective calibration procedures reveals to be
as important as the theoretical knowledge of trend measurement theories. The practical
use of trend extraction techniques for investment purposes constitutes the other
topic addressed in this 8th White Paper.
Nicolas Gaussel
Global Head of Quantitative Asset Management
Q U A N T R E S E A R C H B Y LY X O R
1
2
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Executive Summary
Introduction
The efficient market hypothesis implies that all available information is reflected in current
prices, and thus that future returns are unpredictable. Nevertheless, this assumption has
been rejected in a large number of academic studies. It is commonly accepted that financial
assets may exhibit trends or cycles. Some studies cite slow-moving economic variables related
to the business cycle as an explanation for these trends. Other research argues that investors
are not fully rational, meaning that prices may underreact in the short run and overreact at
long horizons.
Momentum strategies try to benefit from these trends. There are two opposing types:
trend following and contrarian. Trend following strategies are momentum strategies in which
an asset is purchased if the price is rising, while in the contrarian strategy assets are sold
if the price is falling. The first step in both strategies is trend estimation, which is the
focus of this paper. After a review of trend filtering techniques, we address practical issues,
depending on whether trend detection is designed to explain the past or forecast the future.
The simplest trend filtering method is the moving average filter. On average, the noisy
parts of observations tend to cancel each other out, while the trend has a cumulative nature.
But observations can be averaged using many different types of weightings. More generally,
the different averages obtained are referred to as linear filtering. Several examples repre-
senting trend filtering for various linear filters are shown in Figure 1. In this example, the
averaging horizon (65 business days or one year) has much more influence than the type of
averaging.
Other trend following methods, which are classified as nonlinear, use more complex
calculations to obtain more specific results (such as filters based on wavelet analysis, support
vector machines or singular spectrum analysis). For instance, the L1 filter is designed to
obtain piecewise constant trends, which can be interpreted more easily.
Q U A N T R E S E A R C H B Y LY X O R
3
Figure 1: Trend estimate of the S&P 500 index
Figure 3 illustrates that the distributions of the one-month GSCI index returns after
a very positive three-month trend (i.e. above a threshold) clearly dominate the return
distribution after a very negative trend (i.e. below the threshold).
4
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Q U A N T R E S E A R C H B Y LY X O R
5
Furthermore, this persistence effect is also tested in Table 1 for a number of major
financial indices. This table compares the average one-month return following a positive
three-month trend period to the average one-month return following a negative three month
trend period.
On average, for all indices under consideration, returns are higher after a positive trend than
after a negative one. Thus, the trends are persistent, and seem to have a predictive value.
This makes the case for the study of trend following strategies, and highlights the appeal of
trend filtering methods.
Conclusion
The ultimate goal of trend filtering in finance is to design portfolio strategies that may benefit
from the identified trends. Such strategies must rely on appropriate trend estimators and
time horizons. This paper highlights the variety of estimators available in the academic
literature. But the choice of trend estimator is just one of the many questions that arises
in the definition of those strategies. In particular, diversification and risk budgeting are key
aspects of success.
6
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Table of Contents
1 Introduction 9
4 Conclusion 40
A Statistical complements 41
A.1 State space model and Kalman filtering . . . . . . . . . . . . . . 41
A.2 L1 filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.3 Wavelet analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . 47
A.5 Singular spectrum analysis . . . . . . . . . . . . . . . . . . . . . 50
Q U A N T R E S E A R C H B Y LY X O R
7
8
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Abstract
This paper studies trend filtering methods. These methods are widely used in mo-
mentum strategies, which correspond to an investment style based only on the history
of past prices. For example, the CTA strategy used by hedge funds is one of the
best-known momentum strategies. In this paper, we review the different econometric
estimators to extract a trend of a time series. We distinguish between linear and non-
linear models as well as univariate and multivariate filtering. For each approach, we
provide a comprehensive presentation, an overview of its advantages and disadvantages
and an application to the S&P 500 index. We also consider the calibration problem of
these filters. We illustrate the two main solutions, the first based on prediction error,
and the second using a benchmark estimator. We conclude the paper by listing some
issues to consider when implementing a momentum strategy.
Keywords: Momentum strategy, trend following, moving average, filtering, trend extrac-
tion.
JEL classification: G11, G17, C63.
1 Introduction
The efficient market hypothesis tells us that financial asset prices fully reflect all available
information (Fama, 1970). One consequence of this theory is that future returns are not
predictable. Nevertheless, since the beginning of the nineties, a large body of academic
research has rejected this assumption. One of the arguments is that risk premiums are time
varying and depend on the business cycle (Cochrane, 2001). In this framework, returns
on financial assets are related to some slow-moving economic variables that exhibit cyclical
patterns in accordance with the business cycle. Another argument is that some agents are
∗ We are grateful to Guillaume Jamet and Hoang-Phong Nguyen for their helpful comments.
Q U A N T R E S E A R C H B Y LY X O R
9
not fully rational, meaning that prices may underreact in the short run but overreact at long
horizons (Hong and Stein, 1997). This phenomenon may be easily explained by the theory
of behavioural finance (Barberis and Thaler, 2002).
Based on these two arguments, it is now commonly accepted that prices may exhibit
trends or cycles. In some sense, these arguments chime with the Dow theory (Brown et al.,
1998), which is one of the first momentum strategies. A momentum strategy is an investment
style based only on the history of past prices (Chan et al., 1996). We generally distinguish
between two types of momentum strategy:
1. the trend following strategy, which consists of buying (or selling) an asset if the esti-
mated price trend is positive (or negative);
2. the contrarian (or mean-reverting) strategy, which consists of selling (or buying) an
asset if the estimated price trend is positive (or negative).
Contrarian strategies are clearly the opposite of trend following strategies. One of the tasks
involved in these strategies is to estimate the trend, excepted when based on mean-reverting
processes (see D’Aspremont, 2011). In this paper, we provide a survey of the different
trend filtering methods. However, trend filtering is just one of the difficulties in building a
momentum strategy. The complete process of constructing a momentum strategy is highly
complex, especially as regards transforming past trends into exposures – an important factor
that is beyond the scope of this paper.
The paper is organized as follows. Section two presents a survey of the different econo-
metric trend estimators. In particular, we distinguish between methods based on linear
filtering and nonlinear filtering. In section three, we consider some issues that arise when
trend filtering is applied in practice. We also propose some methods for calibrating trend
filtering models and highlight the problem of estimator variance. Section four offers some
concluding remarks.
10
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
yt = xt + εt
where xt represents the trend and εt is a stochastic (or noise) process. There is no precise
definition for trend, but it is generally accepted to be a smooth function representing long-
term movements:
“[...] the essential idea of trend is that it shall be smooth.” (Kendall, 1973).
It means that changes in the trend xt must be smaller than those of the process yt . From a
statistical standpoint, it implies that the volatility of yt − yt−1 is higher than the volatility
of xt − xt−1 :
σ (yt − yt−1 ) σ (xt − xt−1 )
One of the major problems in financial econometrics is the estimation of xt . This is the
subject of signal extraction and filtering (Pollock, 2009).
Finite moving average filtering for trend estimation has a long history. It has been used
in actuarial science since the beginning of the twentieth century2 . But the modern theory of
signal filtering has its origins in the Second World War and was formulated independently
by Norbert Wiener (1941) and Andrei Kolmogorov (1941) in two different ways. Wiener
worked principally in the frequency domain whereas Kolmogorov considered a time-domain
approach. This theory was extensively developed in the fifties and sixties by mathematicians
and statisticians such as Hermann Wold, Peter Whittle, Rudolf Kalman, Maurice Priestley,
George Box, etc. In economics, the problem of trend filtering is not a recent one, and may
date back to the seminal article of Muth (1960). It was extensively studied in the eighties and
nineties in the literature on business cycles, which led to a vast body of empirical research
being carried out in this area3 . However, it is in climatology that trend filtering is most
extensively studied nowadays. Another important point is that the development of filtering
techniques has evolved according to the development of computational power and the IT
industry. The Savitzky-Golay smoothing procedure may appear very basic today though it
was revolutionary4 when it was published in 1964.
In what follows, we review the class of filtering techniques that is generally used to
estimate a trend. Moving average filters play an important role in finance. As they are very
intuitive and easy to implement, they undoubtedly represent the model most commonly used
in trading strategies. The moving average technique belongs to the class of linear filters,
which share a lot of common properties. After studying this class of filters, we consider
some nonlinear filtering techniques, which may be well suited to solving financial problems.
Q U A N T R E S E A R C H B Y LY X O R
11
unobservable process. A filtering procedure consists of applying a filter L to the data y:
x̂ = L (y)
with x̂ = {. . . , x̂−2 , x̂−1 , x̂0 , x̂1 , x̂2 , . . .}. When the filter is linear, we have x̂ = Ly with the
normalisation condition 1 = L1. If we assume that the signal yt is observed at regular
dates5 , we obtain:
∞
x̂t = Lt,t−i yt−i (1)
i=−∞
We deduce that linear filtering may be viewed as a convolution. The previous filter may not
be of much use, however, because it uses future values of yt . As a result, we generally impose
some restriction on the coefficients Lt,t−i in order to use only past and present values of the
signal. In this case, we say that the filter is causal. Moreover, if we restrict our study to
time invariant filters, the equation (1) becomes a simple convolution of the observed signal
yt with a window function Li :
n−1
x̂t = Li yt−i (2)
i=0
With this notation, a linear filter is characterised by a window kernel Li and its support.
The kernel defines the type of filtering, whereas the support defines the range of the filter.
For instance, if we take a square window on a compact support [0, T ] with T = nΔ the
width of the averaging window, we obtain the well-known moving average filter:
1
Li = 1 {i < n}
n
We finish this description by considering the lag representation:
n−1
x̂t = Li Li yt
i=0
dSt
= μt dt + σt dWt
St
where μt is the drift, σt is the volatility and Wt is a standard Brownian motion. The
asset price St is observed in a series of discrete dates {t0 , . . . , tn }. Within this model, the
appropriate signal to be filtered is the logarithm of the price yt = ln St but not the price
itself. Let Rt = ln St − ln St−1 represent the realised return at time t over a unit period. If
μt and σt are known, we have:
1 √
Rt = μt − σt2 Δ + σt Δηt
2
5 We have ti+1 − ti = Δ.
12
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
where ηt is a standard Gaussian white noise. The filtered trend can be extracted using the
following equation:
n−1
x̂t = Li yt−i
i=0
6
and the estimator of μt is :
n−1
1
μ̂t Li Rt−i
Δ i=0
We can also obtain the same result by applying the filter directly to the signal and defining
the derivative of the window function as i = L̇i :
n
1
μ̂t i yt−i
Δ i=0
Remark 1 In some senses, μ̂t and x̂t are related by the following expression:
d
μ̂t = x̂t
dt
Econometric methods principally involve x̂t , whereas μ̂t is more important for trading strate-
gies.
Remark 2 μ̂t is a biased estimator of μt and the bias increases with the volatility of the
process σt . The expression of the unbiased estimator is then:
n−1
1 2 1
μ̂t = σt + Li Rt−i
2 Δ i=0
Remark 3 In the previous analysis, x̂t and μ̂t are two estimators. We may also represent
them by their corresponding probability density functions. It is therefore easy to derive
estimates, but we should not forget that these estimators present some variance. In finance,
and in particular in trading strategies, the question of statistical inference is generally not
addressed. However, it is a crucial factor in designing a successful momentum strategy.
Q U A N T R E S E A R C H B Y LY X O R
13
x̂t = yt . For T > 0, if we assume that the noise εt is independent from xt and is a centered
process, the first contribution of the filtered signal is the average trend:
n−1
1
x̂t = xt−i
n i=0
The above moving average filter can be applied directly to the signal. However, μ̂t is
simply the cumulative return over the window period. It needs only the first and last dates
of the period under consideration.
Moving average crossovers Many practitioners, and even individual investors, use the
moving average of the price itself as a trend indication, instead of the moving average of
returns. These moving averages are generally uniform moving averages of the price. Here
we will consider an average of the logarithm of the price, in order to be consistent with the
previous examples:
n−1
1
ŷtn = yt−i
n i=0
Of course, an average price does not estimate the trend μt . This trend is estimated from
the difference between two moving averages over two different time horizons n1 and n2 .
Supposing that n1 > n2 , the trend μ may be estimated from:
2
μ̂t (ŷ n2 − ŷtn1 ) (4)
(n1 − n2 ) Δ t
In particular, the estimated trend is positive if the short-term moving average is higher
than the long-term moving average. Thus, the sign of the trend changes when the short-
term moving average crosses the long-term moving average. Of course, when the short-term
horizon n1 is one, then the short-term moving average is just the current asset price. The
−1
scaling term 2 (n1 − n2 ) is explained below. It is derived from the interpretation of this
estimator as a weighted moving average of asset returns. Indeed, this estimator can be
interpreted in terms of asset returns by inverting the formula (3) with Li being interpreted
as the primitive of i : ⎧
⎨ 0 if i = 0
Li = i + Li−1 if i = 1, . . . , n − 1
⎩
−n+1 if i = n
7δ is equal to 1 if i = j and 0 otherwise.
i,j
14
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
The weighting of each return in the estimator (4) is represented in Figure 1. It forms a
triangle, and the biggest weighting is given at the horizon of the smallest moving average.
Therefore, depending on the horizon n2 of the shortest moving average, the indicator can
be focused toward the current trend (if n2 is small) or toward past trends (if n2 is as large
as n1 /2 for instance). From these weightings, in the case of a constant trend μ, we can
compute the expectation of the difference between the two moving averages:
n2 n1 n1 − n2 1 2
E [ŷt − ŷt ] = μ − σt Δ
2 2
Enhanced filters To improve the uniform moving average estimator, we may take the
following kernel function:
4 n
i = 2 sgn −i
n 2
We notice that the estimator μ̂t now takes into account all the dates of the window period.
By taking the primitive of the function i , the trend filter is given as follows:
4 n
Li = 2 −
i −
n 2 2
We now move to the second type of moving average filter which is characterised by an
asymmetric form of the convolution kernel. One possibility is to take an asymmetric window
function with a triangular form:
2
Li = (n − i) 1 {i < n}
n2
Q U A N T R E S E A R C H B Y LY X O R
15
By computing the derivative of this window function, we obtain the following kernel:
2
i = (δi − 1 {i < n})
n
The filtering equation of μt then becomes:
n−1
2 1
μ̂t = xt − xt−i
n n i=0
Remark 4 Another way to define μ̂t is to consider the Lanczos generalised derivative
(Groetsch, 1998). Let f (x) be a function. We define the Lanczos derivative of f (x) in
terms of the following relationship:
ε
dL 3
f (x) = lim 3 tf (x + t) dt
dx ε→0 2ε −ε
We first notice that the Lanczos derivative is more general than the traditional derivative.
Although Lanczos’ formula is a more onerous method for finding the derivative, it offers
some advantages. This technique allows us to compute a “pseudo-derivative” at points where
the function is not differentiable. For the observable signal yt , the traditional derivative does
not exist because of the noise εt , but does in the case of the Lanczos derivative. Let us apply
the Lanczos’ formula to estimate the derivative of the trend at the point t − T /2. We obtain:
12 n
n
dL
x̂t = 3 − i yt−i
dt n i=0 2
16
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Q U A N T R E S E A R C H B Y LY X O R
17
Table 1: Correlation between the uniform and Lanczos derivatives
n 5 10 22 65 130 260
Pearson ρ 84.67 87.86 90.14 90.52 92.57 94.03
Kendall τ 65.69 68.92 70.94 71.63 73.63 76.17
Spearman 83.15 86.09 88.17 88.92 90.18 92.19
However, this problem is not well-defined. We also need to impose some restrictions on the
underlying process yt or on the filtered trend x̂t to obtain a solution. For example, we may
consider a deterministic constant trend:
xt = xt−1 + μ
18
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
If we consider a trend that is not constant, we may define the following objective function:
n n−1
1 2
2
(yt − x̂t ) + λ (x̂t−1 − 2x̂t + x̂t+1 )
2 t=1 t=2
In this function, λ is the regularisation parameter which controls the competition between
the smoothness8 of x̂t and the noise yt − x̂t . We may rewrite the objective function in the
vectorial form:
1 2 2
y − x̂ 2 + λ Dx̂ 2
2
where y = (y1 , . . . , yn ), x̂ = (x̂1 , . . . , x̂n ) and the D operator is the (n − 2) × n matrix:
⎡ ⎤
1 −2 1
⎢ 1 −2 1 ⎥
⎢ ⎥
⎢ .. ⎥
D=⎢ . ⎥
⎢ ⎥
⎣ 1 −2 1 ⎦
1 2 1
It is known as the Hodrick-Prescott filter (or L2 filter). This filter plays an important role
in calibrating the business cycle.
Kalman filtering Another important trend estimation technique is the Kalman filter,
which is described in Appendix A.1. In this case, the trend μt is a hidden process which
follows a given dynamic. For example, we may assume that the model is9 :
Rt = μt + σζ ζt
(6)
μt = μt−1 + ση ηt
Here, the equation of Rt is the measurement equation and Rt is the observable signal of
realised returns. The hidden process μt is supposed to follow a random walk. We define
2
μ̂t|t−1 = Et−1 [μt ] and Pt|t−1 = Et−1 μ̂t|t−1 − μt . Using the results given in Appendix
A.1, we have:
μ̂t+1|t = (1 − Kt ) μ̂t|t−1 + Kt Rt
where Kt = Pt|t−1 / Pt|t−1 + σζ2 is the Kalman gain. The estimation error is determined
by Riccati’s equation:
Pt+1|t = Pt|t−1 + ση2 − Pt|t−1 Kt
Riccati’s equation gives us the stationary solution:
ση
P∗ = ση + ση2 + 4σζ2
2
The filter equation becomes:
Q U A N T R E S E A R C H B Y LY X O R
19
with:
2σ
κ= η
ση + ση2 + 4σζ2
This Kalman filter can be considered as an exponential moving average filter with parame-
ter10 λ = − ln (1 − κ):
∞
μ̂t = 1 − e−λ e−λi Rt−i
i=0
with11 μ̂t = Et [μt ]. The filter of the trend x̂t is therefore determined by the following
equation:
∞
x̂t = 1 − e−λ e−λi yt−i
i=0
while the derivative of the trend may be directly related to the observed signal yt as follows:
∞
μ̂t = 1 − e−λ yt − 1 − e−λ eλ − 1 e−λi yt−i
i=1
In Figure 5, we reported the window function of the Kalman filter for several values of λ.
We notice that the cumulative weightings increase strongly with λ. The half-life of this filter
is approximatively equal to
λ−1 − 2−1 ln 2. For example, the half-life for λ = 5% is 14
days.
20
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
We may wonder what the link is between the regression model (5) and the Markov model
(6). Equation (5) is equivalent to the following state space model12 :
yt = xt + σε εt
xt = xt−1 + μ
This model is called the local level model. We may also assume that the slope of the trend
is stochastic, in which case we obtain the local linear trend model:
⎧
⎨ yt = xt + σε εt
xt = xt−1 + μt−1 + σζ ζt
⎩
μt = μt−1 + ση ηt
These three models are special cases of structural models (Harvey, 1989) and may be easily
solved by Kalman filtering. We also deduce that the Markov model (6) is a special case of
the latter when σε = 0.
Remark 5 We have shown that Kalman filtering may be viewed as an exponential moving
average filter when we consider the Markov model (6). Nevertheless, we cannot regard the
Kalman filter simply as a moving average filter. First, the Kalman filter is the optimal
filter in the case of the linear Gaussian model described in Appendix A.1. Second, it could
be regarded as “an efficient computational solution of the least squares method” (Sorensen,
1970). Third, we could use it to solve more sophisticated processes than the Markov model
(6). However, some nonlinear or non Gaussian models may be too complex for Kalman
filtering. These nonlinear models can be solved by particle filters or sequential Monte Carlo
methods (see Doucet et al., 1998).
Q U A N T R E S E A R C H B Y LY X O R
21
Figure 6: Kalman filtered and smoothed components
yt = f (t) + εt
p
j
= β0 (τ ) + βj (τ ) (τ − t) + εt
j=1
For a given value of τ , we estimate the parameters β̂j (τ ) using weighted least squares with
the following weightings:
τ −t
wt = K
h
where K is the kernel function with a bandwidth h. We deduce that:
22
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
2.3.2 L1 filtering
The idea of the Hodrick-Prescott filter can be generalised to a larger class of filters by using
the Lp penalty condition instead of the L2 penalty. This generalisation was previously
15 We have x̂t = S (t) = yt .
16 We have x̂t = S (t) = ĉ + μ̂t with (ĉ, μ̂) the OLS estimate of yt on a constant and time t because the
optimum is reached for S (τ ) = 0.
17 For the kernel regression, we use a Gaussian kernel with a bandwidth h = 0.10. We notice the impact
of the degree of polynomial. The higher the degree, the smoother the trend (and the slope of the trend).
18 For the loess regression, the degree of polynomial is set to 1 and the bandwidth h is 0.02. We show the
Q U A N T R E S E A R C H B Y LY X O R
23
discussed in the work of Daubechies et al. (2004) in relation to the linear inverse problem,
while Tibshirani (1996) considers the Lasso regression problem. If we consider an L1 filter,
the objective function becomes:
n n−1
1 2
(yt − x̂t ) + λ |x̂t−1 − 2x̂t + x̂t+1 |
2 t=1 t=2
We have illustrated the L1 filter in Figure 8. Contrary to all other previous methods, the
filtered signal comprises a set of straight trends and breaks21 , because the L1 norm imposes
the condition that the second derivative of the filtered signal must be zero. The competition
between the two terms in the objective function turns to the competition between the number
of straight trends (or the number of breaks) and the closeness to the data. Thus, the
smoothing parameter λ plays an important role for detecting the number of breaks. This
explains why L1 filtering is radically different to L2 (or Hodrick-Prescott) filtering. Moreover,
it is easy to compute the slope of the trend μ̂t for the L1 filter. It is a step function, indicating
clearly if the trend is up or down, and when it changes (see Figure 8).
We note y (ω) = F (y). By construction, we have y = F −1 (y) with F −1 the inverse Fourier
transform. A simple idea for denoising in spectral analysis is to set some coefficients y (ω)
to zero before reconstructing the signal. Figure 9 is an illustration of denoising using the
thresholding rule. Selected parts of the frequency spectrum can easily be manipulated by
filtering tools. For example, some can be attenuated, and others may be completely removed.
Applying the inverse Fourier transform to this filtered spectrum leads to a filtered time series.
Therefore, a smoothing signal can be easily performed by applying a low-pass filter, that is,
by removing the higher frequencies. For example, we have represented two denoised signals
of the S&P 500 index in Figure 9. For the first one, we use a 95% thresholding procedure
whereas 99% of the Fourier coefficients are set to zero in the second case. One difficulty
with this approach is the bad time location for low frequency signals and the bad frequency
location for the high frequency signals. It is then difficult to localise when the trend (which
is located in low frequencies) reverses. But the main drawback of spectral analysis is that
it is not well suited to nonstationary processes (Martin and Flandrin, 1985, Fuentes, 2002,
Oppenheim and Schafer, 2009).
24
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Q U A N T R E S E A R C H B Y LY X O R
25
A solution consists of adopting a double dimension analysis, both in time and frequency.
This approach corresponds to the wavelet analysis. The method of denoising is the same as
described previously and the estimation of xt is done in three steps:
1. we compute the wavelet transform W of the original signal yt to obtain the wavelet
coefficients ω = W (y);
ω = D (ω)
3. We convert the modified wavelet coefficients into a new signal using the inverse wavelet
transform W −1 :
x = W −1 (ω )
There are two principal choices in this approach. First, we have to specify which mother
wavelet to use. Second, we have to define the denoising rule. Let ω − and ω + be two scalars
with 0 < ω − < ω + . Donoho and Johnstone (1995) define several shrinkage methods22 :
• Hard shrinkage
ωi = ωi · 1 |ωi | > ω +
• Soft shrinkage
ωi = sgn (ωi ) · |ωi | − ω + +
• Semi-soft shrinkage
⎧
⎨ 0 si |ωi | ≤ ω −
−1 +
ωi = sgn (ωi ) (ω − ω ) ω (|ωi | − ω ) si ω − < |ωi | ≤ ω +
+ − −
⎩
ωi si |ωi | > ω +
Wavelet filtering is illustrated in Figure 10. We have computed the wavelet coefficients
using the cascade algorithm of Mallat (1989) and the low-pass and high-pass filters of order
6 proposed by Daubechies (1992). The filtered trend is obtained using quantile shrinkage.
In the first case, the noisy signal remains because we consider all the coefficients (q = 0). In
the second and third cases, 95% and 99% of the wavelet coefficients are set to zero23 .
26
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Until now, we have assumed that the trend is specific to a financial asset. However, we may
be interested in estimating the common trend of several financial assets. For example, if we
wanted to estimate the trend of emerging markets equities, we could use a global index like
the MSCI EM or extract the trend by considering several indices, e.g. the Bovespa index
(Brazil), the RTS index (Russia), the Nifty index (India), the HSCEI index (China), etc. In
this case, the trend-cycle model becomes:
⎛ ⎞ ⎛ ⎞
(1) (1)
yt ε
⎜ . ⎟ ⎜ t. ⎟
⎜ . ⎟ = xt + ⎜ . ⎟
⎝ . ⎠ ⎝ . ⎠
(m) (m)
yt εt
(j) (j)
where yt and εt are respectively the signal and the noise of the financial asset j and xt
is the common trend. One idea for estimating the common trend is to obtain the mean of
the specific trends:
m
1 (j)
x̂t = x̂
m j=1 t
Q U A N T R E S E A R C H B Y LY X O R
27
If we consider moving average filtering, it is equivalent to applying the filter to the average
m (j)
filter26 ȳt = m
1
j=1 yt . This rule is also valid for some nonlinear filters such as L1 filtering
(see Appendix A.2). In what follows, we consider the two main alternative approaches
developed in econometrics to estimate a (stochastic) common trend.
The econometrics of nonstationary time series may also help us to estimate a common trend.
(j) (j) (j)
yt is said to be integrated of order 1 if the change yt − yt−1 is stationary.
We will note
(j) (j) (1) (m)
yt ∼ I (1) and (1 − L) yt ∼ I (0). Let us now define yt = yt , . . . , yt . The vector yt
is cointegrated of rank r if there exists a matrix β of rank r such that zt = β yt ∼ I (0).
In this case, we show that yt may be specified by an error-correction model (Engle and
Granger, 1987):
∞
Δyt = γzt−1 + Φi Δyt−i + ζt (7)
i=1
where ζt is a I (0) vector process. Stock and Watson (1988) propose another interesting
representation of cointegration systems. Let ft be a vector of r common factors which are
I (1). Therefore, we have:
yt = Aft + ηt (8)
where ηt is a I (0) vector process and ft is a I (1) vector process. One of the difficulties with
this type of model is the identification step (Peña and Box, 1987). Gonzalo and Granger
(1995) suggest defining a permanent-transitory (P-T) decomposition:
y t = P t + Tt
such that the permanent component Pt is difference stationary, the transitory component Tt
is covariance stationary and (ΔPt , Tt ) satisfies a constrained autoregressive representation.
Using this framework and some other conditions, Gonzalo and Granger show that we may
obtain the representation (8) by estimating the relationship (7):
ft = γ̆ yt (9)
where γ̆ γ = 0. They then follow the works of Johansen (1988, 1991) to derive the maximum
likelihood estimator of γ̆. Once we have estimated the relationship (9), it is also easy to
identify the common trend27 x̂t .
26 We have:
1 XX
m n−1
(j)
x̂t = Li yt−i
m j=1 i=0
0 1
X
n−1
1 X (j) A
m
= Li @ y
i=0
m j=1 t−i
X
n−1
= Li ȳt−i
i=0
28
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Remark 6 The case ση = 0 has been extensively studied by Chang et al. (2009). In
particular, they show that yt is cointegrated with β = Ω−1 Γ and Γ a m × (m − 1) matrix
such that Γ Ω−1 α = 0 and Γ Ω−1 Γ = Im−1 . Using the P-T decomposition, they also found
that the common stochastic trend is given by α Ω−1 yt , implying that the above averaging
rule is not optimal.
We come back to the example given in Figure 6 page 22. Using the second set of
parameters, we now consider three stock indices: the S&P 500 index, the Stoxx 600 index
and the MSCI EM index. For each index, we estimate the filtered trend. Moreover, using the
previous common stochastic trend model28 , we estimate the common trend for the bivariate
signal (S&P 500, Stoxx 600) and the trivariate signal (S&P 500, Stoxx 600, MSCI EM).
Q U A N T R E S E A R C H B Y LY X O R
29
3 Trend filtering in practice
3.1 The calibration problem
For the practical use of the trend extraction techniques discussed above, the calibration of
filtering parameters is crucial. These calibrated parameters must incorporate our prediction
requirement or they can be mapped to a commonly-known benchmark estimator. These
constraints offer us some criteria for determining the optimal parameters for our expected
prediction horizon. Below, we consider two possible calibration schemes based on these
criteria.
The problem is that the computation of the log-likelihood for the innovation process
vt = yt − Et−h [yt ] is trickier because there is generally no analytic expression. This is
why we do not recommend this technology for trend filtering problems, because the trends
estimated are generally very short-term. A better solution is to employ a cross-validation
procedure to calibrate the parameters θ of the filters discussed above. Let us consider the
calibration scheme presented in Figure 13. We divide our historical data into a training set
and a validation set, which are characterised by two time parameters T1 and T2 . The size
of training set T1 controls the precision of our calibration, for a fixed parameter θ. For this
training set, the value of the expectation of Et−h [yt ] is computed. The second parameter
29 Another way of estimating the parameters is to consider the log-likelihood function in the frequency
domain analysis (Roncalli, 2010). In the case of the local linear trend model, the stationary form of yt is
S (yt ) = (1 − L)2 yt . We deduce that the associated log-likelihood function is:
1 X 1 X I (λj )
n−1 n−1
n
=− ln 2π − ln f (λj ) −
2 2 j=0 2 j=0 f (λj )
where I (λj ) is the periodogram of S (yt ) and f (λ) is the spectral density:
30
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
T2 determines the size of the validation set, which is used to estimate the prediction error:
n−h
2
e (θ; h) = (yt − Et−h [yt ])
t=1
This quantity is directly related to the prediction horizon h = T2 for a given investment
strategy. The minimisation of the prediction error leads to the optimal value θ of the filter
parameters which will be used to predict the trend for the test set. For example, we apply
this calibration scheme for L1 filtering for h equal to 50 days. Figure 14 illustrates the
calibration procedure for the S&P 500 index with T1 = 400 and T2 = 50. Minimising the
cumulative prediction error over the validation set gives the optimal value λ = 7.03.
| | |
T1 T2 T2
|
Historical data Today Prediction
Q U A N T R E S E A R C H B Y LY X O R
31
Figure 14: Calibration procedure with the S&P 500 index for the L1 filter
using spectral analysis. Though the L2 filter provides an explicit solution which is a great
advantage for numerical implementation, the calibration of the smoothing parameter λ is
not straightforward. We propose to calibrate the L2 filter by comparing the spectral density
of this filter with that obtained using the uniform moving average filter with horizon n for
which the spectral density is:
2
n−1
MA 1
−iωt
f (ω) = 2
e
t=0
−1
For the L2 filter, the solution has the analytical form x̂ = 1 + 2λD D y. Therefore, the
spectral density can also be computed explicitly:
2
1
f HP (ω) =
1 + 4λ (3 − 4 cos ω + cos 2ω)
2
This spectral density can then be approximated by 1/ 1 + 2λω 4 . Hence, the spectral
−1/4
width is (2λ) for the L2 filter whereas it is 2πn−1 for the uniform moving average filter.
The calibration of the L2 filter could be achieved by matching these two quantities. Finally,
we obtain the following relationship:
1 n 4
λ ∝ λ =
2 2π
In Figure 15, we represent the spectral density of the uniform moving average filter for
different window sizes n. We also report the spectral density of the corresponding L2 filters.
To obtain this, we calibrated the optimal parameter λ by least square minimisation. In
32
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Figure 16, we compare the optimal estimator λ with that corresponding to 10.27 × λ . We
notice that the approximation is very good30 .
(1)
It indicates how far the estimates are from the true value. We say that the estimator μ̂t
(2)
is more efficient than the estimator μ̂t if its MSE is lower:
(1) (2) (1) (2)
μ̂t μ̂t ⇔ MSE μ̂t ≤ MSE μ̂t
Q U A N T R E S E A R C H B Y LY X O R
33
Figure 16: Relationship between the value of λ and the length of the moving average filter
The first component is the variance of the estimator var (μ̂t ) whereas the second component
is the square of the bias B (μ̂t ). Generally, we are interested by estimators that are unbiased
(B (μ̂t ) = 0). If this is the case, comparing two estimators is equivalent to comparing their
variances.
34
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Q U A N T R E S E A R C H B Y LY X O R
35
with sgn (yt−i − yt−j ) = 1 if yt−i > yt−j and sgn (yt−i − yt−j ) = −1 if yt−i < yt−j . We
have31 : n (n − 1) (2n + 5)
(n)
var St =
18
We can show that:
n (n + 1) (n) n (n + 1)
− ≤ St ≤
2 2
The bounds are reached if yt < yt−i (negative trend) or yt > yt−i (positive trend) for i ∈ N∗ .
We can then normalise the score:
(n)
(n) 2St
St =
n (n + 1)
(n)
St takes the value +1 (or −1) if we have a perfect positive (or negative) trend. If there is
(n)
no trend, it is obvious that St 0. Under this null hypothesis, we have:
(n)
Zt −→ N (0, 1)
n→∞
with:
(n)
(n) St
Zt =%
(n)
var St
(n)
In Figure 19, we reported the normalised score St for the S&P 500 index and different
values of n. Statistics relating to the null hypothesis are given in Table 2 for the study
period. We notice that we generally reject the hypothesis that there is no trend when we
consider a period of one year. The number of cases when we observe a trend increases if we
consider a shorter period. For example, if n is equal to 10 days, we accept the hypothesis
that there is no trend in 42% of cases when the confidence level α is set to 90%.
(10)
Remark 7 We have reported the statistic St against the trend estimate32 μ̂t for the S&P
(10)
500 index since January 2000. We notice that μ̂t may be positive whereas St is negative.
This illustrates that a trend measurement is just an estimate. It does not mean that a trend
exists.
31 If there are some tied sequences (yt−i = yt−i−1 ), the formula becomes:
!
“ ” 1 X
g
(n)
var St = n (n − 1) (2n + 5) − nk (nk − 1) (2nk + 5)
18 k=1
with g the number of tied sequences and nk the number of data points in the kth tied sequence.
32 It is computed with a uniform moving average of 10 days.
36
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Q U A N T R E S E A R C H B Y LY X O R
37
3.3 From trend filtering to trend forecasting
There are two possible applications for the trend following problem. First, trend filtering
can analyse the past. A noisy signal can be transformed into a smoother signal, which can be
interpreted more easily. An ex-post analysis of this kind can, for instance, clearly separate
increasing price periods from decreasing price periods. This analysis can be performed on
any time series, or even on a random walk. For example, we have reported four simulations
of a geometric Brownian motion without drift and annual volatility of 20% in Figure 21. In
this context, trend filtering could help us to estimate the different trends in the past.
On the other hand, trend analysis may be used as a predictive tool. Prediction is a
much more ambitious objective than analysing the past. It cannot be performed on any
time series. For instance, trend following predictions suppose that the last observed trend
influences future returns. More precisely, these predictors suppose that positive (or negative)
trends are more likely to be followed by positive (or negative) returns. Such an assumption
has to be tested empirically. For example, it is obvious that the time series in Figure 21
exhibit certain trends, whereas we know that there is no trend in a geometric Brownian
motion without drift. Thus, we may still observe some trends in an ex-post analysis. It does
not mean, however, that trends will persist in the future.
The persistence of trends is tested here in a simple framework for major financial in-
dices33 . For each of these indices the average one-month returns are separated into two sets.
The first set includes one-month returns that immediately follow a positive three-month
return (this is negative for the second set). The average one-month return is computed for
each of these two sets, and the results are given in Table 3. These results clearly show
33 The study period begins in January 1995 (January 1999 for the MSCI EM) and finish in October 2011.
38
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
that, on average, higher returns can be expected after a positive three-month return than
after a negative three-month period. Therefore, observation of the current trend may have a
predictive value for the indices under consideration. Moreover, we consider the distribution
of the one-month returns, based on past three-month returns. Figure 22 illustrates the case
of the GSCI index. In the first quadrant, the one-month returns are divided into two sets,
depending on whether the previous three-month return is positive or negative. The cumu-
lative distributions of these two sets are shown. In the second quadrant, we consider, on
the one hand, the distribution of one-month returns following a three-month return below
−5% and, on the other hand, the distribution of returns following a three-month return
exceeding +5%. The same procedure is repeated in the other quadrants, for a 10% and a
15% threshold. This simple test illustrates the usefulness of trend following strategies. Here,
trends seem persistent enough to study such strategies. Of course, on other time scales or
for other assets, one may obtain opposite results that would support contrarian strategies.
Q U A N T R E S E A R C H B Y LY X O R
39
4 Conclusion
The ultimate goal of trend filtering in finance is to design portfolio strategies that may
benefit from these trends. But the path between trend measurement and portfolio allocation
is not straightforward. It involves studies and explanations that would not fit in this paper.
Nevertheless, let us point out some major issues. Of course, the first problem is the selection
of the trend filtering method. This selection may lead to a single procedure or to a pool of
methods. The selection of several methods raises the question of an aggregation procedure.
This can be done through averaging or dynamic model selection, for instance. The resulting
trend indicator is meant to forecast future asset returns at a given horizon.
Intuitively, an investor should buy assets with positive return forecasts and sell assets
with negative forecasts. But the size of each long or short position is a quantitative problem
that requires a clear investment process. This process should take into account the risk
entailed by each position, compared with the expected return. Traditionally, individual
risks can be calculated in relation to asset volatility. A correlation matrix can aggregate
those individual risks into a global portfolio risk. But in the case of a multi-asset trend
following strategy, should we consider the correlation of assets or the correlation of each
individual strategy? These may be quite different, as the correlations between strategies
are usually smaller than the correlations between assets in absolute terms. Even when the
portfolio risks can be calculated, the distribution of those risks between assets or strategies
remains an open problem. Clearly, this distribution should take into account the individual
risks, their correlations and the expected return of each asset. But there are many competing
allocation procedures, such as Markowitz portfolio theory or risk budgeting methods.
In addition, the total amount of risk in the portfolio must be decided. The average target
volatility of the portfolio is closely related to the risk aversion of the final investor. But this
total amount of risk may not be constant over time, as some periods could bring higher
expected returns than others. For example, some funds do not change the average size of
their positions during period of high market volatility. This increases their risks, but they
consider that their return opportunities, even when risk-adjusted, are greater during those
periods. On the contrary, some investors reduce their exposure to markets during volatility
peaks, in order to limit their potential drawdowns. Anyway, any consistent investment
process should measure and control the global risk of the portfolio.
These are just a few questions relating to trend following strategies. Many more arise in
practical cases, such as execution policies and transaction cost management. Each of these
issues must be studied in depth, and re-examined on a regular basis. This is the essence of
quantitative management processes.
40
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
A Statistical complements
A.1 State space model and Kalman filtering
A state space model is defined by a transition equation and a measurement equation. In
the measurement equation, we postulate the relationship between an observable vector and
a state vector, while the transition equation describes the generating process of the state
variables. The state vector αt is generated by a first-order Markov process of the form:
αt = Tt αt−1 + ct + Rt ηt
where αt is the vector of the m state variables, Tt is a m × m matrix, ct is a m × 1 vector
and Rt is a m × p matrix. The measurement equation of the state-space representation is:
yt = Zt αt + dt + εt
where yt is a n-dimension time series, Zt is a n × m matrix, dt is a n × 1 vector. ηt and εt
are assumed to be white noise processes of dimensions p and n respectively. These two last
uncorrelated processes are Gaussian with zero mean and respective covariance matrices Qt
and Ht . α0 ∼ N (a0 , P0 ) describes the initial position of the state vector. We define at and
a t|t−1 as the optimal estimators of αt based on all the information available respectively at
time t and t − 1. Let Pt and P t|t−1 be the associated covariance matrices34 . The Kalman
filter consists of the following set of recursive equations (Harvey, 1990):
⎧
⎪
⎪ a t|t−1 = Tt at−1 + ct
⎪
⎪
⎪
⎪ P t|t−1 = Tt Pt−1 Tt + Rt Qt Rt
⎪
⎪ y t|t−1 = Zt a t|t−1 + dt
⎨
vt = yt − y t|t−1
⎪
⎪
⎪
⎪ Ft = Zt P t|t−1 Zt + Ht
⎪
⎪
⎪
⎪ at = a t|t−1 + P t|t−1 Zt Ft−1 vt
⎩
Pt = Im − P t|t−1 Zt Ft−1 Zt P t|t−1
where vt is the innovation process with covariance matrix Ft and y t|t−1 = Et−1 [yt ]. Harvey
(1989) shows that we can obtain a t+1|t directly from a t|t−1 :
a t+1|t = (Tt+1 − Kt Zt ) a t|t−1 + Kt yt + (ct+1 − Kt dt )
where Kt = Tt+1 P t|t−1 Zt Ft−1 is the matrix of gain. We also have:
a t+1|t = Tt+1 a t|t−1 + ct+1 + Kt yt − Zt a t|t−1 − dt
Finally, we obtain:
yt = Zt a t|t−1 + dt + vt
a t+1|t = Tt+1 a t|t−1 + ct+1 + Kt vt
This system is called the innovation representation.
Let t be a fixed given date. We define a t|t = Et [αt ] and P t|t = Et a t|t − αt a t|t − αt
with t ≤ t . We have a t |t = at and P t |t = Pt . The Kalman smoother is then defined
by the following set of recursive equations:
−1
Pt∗
= Pt Tt+1 P t+1|t
∗
a t|t = at + Pt a t+1|t − a t+1|t
P t|t = Pt + Pt∗ P t+1|t − P t+1|t Pt∗
h i
34 We have at= Et [αt ], a t|t−1 = Et−1 [αt ], Pt = Et (at − αt ) (at − αt ) and P t|t−1 =
h` ´` ´ i
Et−1 a t|t−1 − αt a t|t−1 − αt where Et indicates the conditional expectation operator.
Q U A N T R E S E A R C H B Y LY X O R
41
A.2 L1 filtering
A.2.1 The dual problem
The L1 filtering problem can be solved by considering the dual problem which is a QP
programme. We first rewrite the primal problem with a new variable z = Dx̂:
1 2
min y − x̂ 2 + λ z 1
2
u.c. z = Dx̂
We now construct the Lagrangian function with the dual variable ν ∈ Rn−2 :
1 2
L (x̂, z, v) = y − x̂ 2 + λ z 1 + ν (Dx̂ − z)
2
The dual objective function is obtained in the following way:
1
inf x̂,z L (x̂, z, ν) = − ν DD ν + y D ν
2
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent
to the dual problem:
1
min ν DD ν − y D ν
2
u.c. −λ1 ≤ ν ≤ λ1
x̂ = y − D ν
min f0 (θ)
Aθ = b
u.c.
fi (θ) < 0 for i = 1, . . . , m
where f0 , . . . , fm : Rn → R are convex and twice continuously differentiable and rank (A) =
p < n. The inequality constraints will become implicit if the problem is rewritten as:
m
min f0 (θ) + I− (fi (θ))
i=1
u.c. Aθ = b
42
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
with τ → ∞. Finally the Kuhn-Tucker condition for this approximation problem gives
rt (θ, λ, ν) = 0 with:
⎛
⎞
∇f0 (θ) + ∇f (θ) λ + A ν
rτ (θ, λ, ν) = ⎝ − diag (λ) f (θ) − τ −1 1 ⎠
Aθ − b
The solution of rτ (θ, λ, ν) = 0 can be obtained using Newton’s iteration for the triple
π = (θ, λ, ν):
rτ (π + Δπ) rτ (π) + ∇rτ (π) Δπ = 0
−1
This equation gives the Newton step Δπ = −∇rτ (π) rτ (π), which defines the search
direction.
1 ' '2
m
' (j) '
min 'y − x̂' + λ z 1
2 j=1 2
u.c. z = Dx̂
The dual objective function becomes:
1 (j)
m
1
inf x̂,z L (x̂, z, ν) = − ν DD ν + ȳ D ν + y − ȳ y (j) − ȳ
2 2 j=1
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent
to the dual problem:
1
min ν DD ν − ȳ D ν
2
u.c. −λ1 ≤ ν ≤ λ1
The solution is then x̂ = ȳ − D ν.
Q U A N T R E S E A R C H B Y LY X O R
43
The process I1 (T ) is a Wiener integral (or a Gaussian process) with variance:
T
# $ 2 T3
E I12 (T ) = (T − t) dt =
0 3
In this case, we expect that λmax ∼ T 3/2 . The second order primitive can be calculated in
the following way:
T
I2 (T ) = I1 (t) dt
0
T
= I1 (T ) T − t dI1 (T )
0
T
= I1 (T ) T − tWt dt
0
2
T 2
T t
= I1 (T ) T − WT + dWt
2 0 2
T
T2 t2
= − WT + T2 − Tt + dWt
2 0 2
1 T 2
= (T − t) dWT
2 0
This quantity is again a Gaussian process with variance:
1 T 4 T5
E[I22 (T )] = (T − t) dt =
4 0 20
The first wavelet approach appeared in the early eighties in seismic data analysis. The
term wavelet was introduced in the scientific community by Grossmann and Morlet (1984).
Since 1986, a great deal of theoretical research, including wavelets, has been developed.
The wavelet transform uses a basic function, called the mother wavelet, then dilates and
translates it to capture features that are local in time and frequency. The distribution of the
time-frequency domain with respect to the wavelet transform is long in time when capturing
low frequency events and long in frequency when capturing high frequency events. As an
example, we represent some mother wavelets in Figure 24.
The aim of wavelet analysis is to separate signal trends and details. These different
components can be distinguished by different levels of resolution or different sizes/scales
of detail. In this sense, it generates a phase space decomposition which is defined by two
44
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Q U A N T R E S E A R C H B Y LY X O R
45
parameters (scale and location) in opposition to a Fourier decomposition. A wavelet ψ (t)
is a function of time t such that:
+∞
ψ (t) dt = 0
−∞
+∞
2
|ψ (t)| dt = 1
−∞
The continuous wavelet transform is a function of two variables W (u, s) and is given by
projecting the time series x (t) onto a particular wavelet ψ by:
+∞
W (u, s) = x (t) ψu,s (t) dt
−∞
with:
1 t−u
ψu,s (t) = √ ψ
s s
which corresponds to the mother wavelet translated by u (location parameter) and dilated
by s (scale parameter). If the wavelet satisfies the previous properties, the inverse operation
may be performed to produce the original signal from its wavelet coefficients:
+∞
+∞
x (t) = W (u, s) ψ (u, s) du ds
−∞ −∞
The continuous wavelet transform of a time series signal x (t) gives an infinite number
of coefficients W (u, s) where u ∈ R and s ∈ R+ , but many coefficients are close or equal to
zero. The discrete wavelet transform can be used to decompose a signal into a finite number
of coefficients where we use s = 2−j as the scale parameter and u = k2−j as the location
parameter with j ∈ Z and k ∈ Z. Therefore ψu,s (t) becomes:
j
ψj,k (t) = 2 2 ψ 2j t − k
where j = 1, 2, ..., J in a J-level decomposition. The wavelet representation of a discrete
signal x (t) is given by:
j−1
J−1
2
x (t) = s(0) φ (t) + d(j),k ψj,k (t)
j=0 k=0
Introduced by Mallat (1989), the multi-scale analysis corresponds to the following iter-
ative scheme:
x
s d
ss sd
sss ssd
ssss sssd
46
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
where the high-pass filter defines the details of the data and the low-pass filter defines the
smoothing signal. In this example, we obtain these wavelet coefficients:
⎡ ⎤
ssss
⎢ sssd ⎥
⎢ ⎥
W =⎢ ⎢ ssd ⎥
⎥
⎣ sd ⎦
d
Applying this pyramidal algorithm to the time series signal up to the J resolution level gives
us the wavelet coefficients: ⎡ ⎤
s(0)
⎢ d(0) ⎥
⎢ ⎥
⎢ d(1) ⎥
⎢ ⎥
W =⎢ ⎢ . ⎥
⎥
⎢ . ⎥
⎢ ⎥
⎣ . ⎦
d(J−1)
H = {x : h (x) = w x + b = 0}
The vector w is interpreted as the normal vector to the hyperplane. We denote its norm
w and its direction ŵ = w/ w . In Figure 25, we give a geometric interpretation of the
margin in the linear case. Let x+ and x− be the closest points to the hyperplane from the
positive side and negative side. These points determine the margin to the boundary from
which the two classes of points D are separated:
1 1
mD (h) = ŵ (x+ − x− ) =
2 w
Q U A N T R E S E A R C H B Y LY X O R
47
Figure 25: Geometric interpretation of the margin in a linear SVM
The main idea of a maximum margin classifier is to determine the hyperplane that maximises
the margin. For a separable dataset, the margin SVM is defined by the following optimisation
problem:
1 2
min w
w,b 2
u.c. yi w xi + b > 1 for i = 1, . . . , n
The historical approach to solving this quadratic problem with nonlinear constraints is to
map the primal problem to the dual problem:
n
n n
1
max αi − αi αj yi yj x
i xj
α
i=1
2 i=1 j=1
u.c. αi ≥ 0 for i = 1, . . . , n
We notice that linear SVM depends on input data via the inner product. An intelligent
way to extend SVM formalism to the nonlinear case is then to replace the inner product
with a nonlinear kernel. Hence, the nonlinear SVM dual problem can be obtained by sys-
tematically replacing the inner product xi xj by a general kernel K (xi , xj ). Some standard
kernels are widely used in pattern recognition, for example polynomial, radial basis or neural
48
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
We have seen that the linear SVM is a special case of nonlinear SVM within the kernel
approach. We therefore consider the nonlinear case directly where the approximate function
of the regression has the following form f (x) = w φ (x) + b. In the VRM framework, we
assume that P (x, y) is a Gaussian noise with variance σ 2 :
n
1 p 2
R (f ) = |f (xi ) − yi | + σ 2 w
n i=1
In the present form, the regression looks very similar to the SVM classification problem and
can be solved in the same way by mapping to the dual problem. We notice that the SVM
regression can be easily generalised in two possible ways:
1. by introducing a more general loss function such as the ε-SV regression proposed by
Vapnik (1998);
2. by using a weighting distribution ω for the empirical distribution:
n
dP (x, y) = ωi δxi (x) δyi (y)
i=1
` ´p “ ´”
36 We 2 `
have, respectively, K (xi , xj ) = xi xj + 1 , K (xi , xj ) = exp − xi − xj / 2σ
2 or
` ´
K (xi , xj ) = tanh ax x
i j − b .
37 This framework called ERM was first introduced by Vapnik and Chervonenskis (1991).
38 This framework is called VRM (Chapelle, 2002).
Q U A N T R E S E A R C H B Y LY X O R
49
As financial series have short memory and depend more on the recent past, an asym-
metric weight distribution focusing on recent data would improve the prediction39 .
w φ (xi ) + b − yi = 0
for support vectors (xi , yi ). In order to achieve a good level of accuracy for the estimation
of b, we average out the set of support vectors and obtain b . The SVM regressor is then
given by the following formula:
n
f (x) = αi K (x, xi ) + b
i=1
In Figure 26, we apply SVM regression with the Gaussian kernel to the S&P 500 index.
The kernel parameter σ characterises the estimation horizon which is equivalent to period
n in the moving average regression.
The method is based on the principal component analysis of the auto-covariance matrix
of the time series y = (y1 , . . . , yt ). Let n be the window length such that n = t − m + 1 with
m < t/2. We define the n × m Hankel matrix H as the matrix of the m concatenated lag
vector of y: ⎛ ⎞
y1 y2 y3 ··· ym
⎜ y2 y3 y4 · · · ym+1 ⎟
⎜ ⎟
⎜ .. ⎟
H = ⎜ y3 ⎜ y4 y5 ··· . ⎟
⎟
⎜. . . . ⎟
⎝. . .
. .
. . . yt−1 ⎠
yn yn+1 yn+2 · · · yt
We recover the time series y by diagonal averaging:
m
1 (i,j)
yp = H (10)
αp j=1
50
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
This relationship seems trivial because each H(i,j) is equal to yp with respect to the condi-
tions for i and j. But this equality no longer holds if we apply factor analysis. Let C = H H
be the covariance matrix of H. By performing the eigenvalue decomposition C = V ΛV , we
can deduce the corresponding principal components:
Pk = HVk
Ĥ = Pk Vk
We have Ĥ = H if all the components are selected. If k < m, we have removed the noise and
the trend x̂ is estimated by applying the diagonal averaging procedure (10) to the matrix
Ĥ.
We have applied the singular spectrum decomposition to the S&P 500 index with different
lags m. For each lag, we compute the Hankel matrix H, then deduce the matrix Ĥ using
only the first eigenvector (k = 1) and estimate the corresponding trend. Results are given
in Figure 27. As for other methods, such as nonlinear filters, the calibration depends on the
parameter m, which controls the window length.
Q U A N T R E S E A R C H B Y LY X O R
51
Figure 27: SSA filtering
52
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
References
[1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy T. (2008),
A Review of Some Modern Approaches to the Problem of Trend Extraction , US Census
Bureau, RRS #2008/03.
[2] Antoniadis A., Gregoire G. and McKeague I.W. (1994), Wavelet Methods for
Curve Estimation, Journal of the American Statistical Association, 89(428), pp. 1340-
1353.
[3] Barberis N. and Thaler T. (2002), A Survey of Behavioral Finance, NBER Working
Paper, 9222.
[4] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decomposition of
Economic Time Series into Permanent and Transitory Components with Particular
Attention to Measurement of the Business Cycle, Journal of Monetary Economics,
7(2), pp. 151-174.
[5] Boser B.E., Guyon I.M. and Vapnik V. (1992), A Training Algorithm for Optimal
Margin Classifier, Proceedings of the Fifth Annual Workshop on Computational Learn-
ing Theory, pp. 114-152.
[6] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge University
Press.
[7] Brockwell P.J. and Davis R.A. (2003), Introduction to Time Series and Forecasting,
Springer.
[8] Broomhead D.S. and King G.P. (1986), On the Qualitative Analysis of Experimental
Dynamical Systems, in Sarkar S. (ed.), Nonlinear Phenomena and Chaos, Adam Hilger,
pp. 113-144.
[9] Brown S.J., Goetzmann W.N. and Kumar A. (1998), The Dow Theory: William
Peter Hamilton’s Track Record Reconsidered, Journal of Finance, 53(4), pp. 1311-1333.
[10] Burch N., Fishback P.E. and Gordon R. (2005), The Least-Squares Property of the
Lanczos Derivative, Mathematics Magazine, 78(5), pp. 368-378.
[11] Carhart M.M. (1997), On Persistence in Mutual Fund Performance, Journal of Fi-
nance, 52(1), pp. 57-82.
[12] Chan L.K.C., Jegadeesh N. and Lakonishok J. (1996), Momentum Strategies, Jour-
nal of Finance, 51(5), pp. 1681-1713.
[13] Chang Y., Miller J.I. and Park J.Y. (2009), Extracting a Common Stochastic Trend:
Theory with Some Applications, Journal of Econometrics, 150(2), pp. 231-247.
[14] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive Tuning
and Prior Knowledge, PhD thesis, University of Paris 6.
[15] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Series: A
Model for the Census X-11 Program, Journal of the American Statistical Association,
71(355), pp. 581-587.
[16] Cleveland W.S. (1979), Robust Locally Regression and Smoothing Scatterplots, Jour-
nal of the American Statistical Association, 74(368), pp. 829-836.
Q U A N T R E S E A R C H B Y LY X O R
53
[17] Cleveland W.S. and Devlin S.J. (1988), Locally Weighted Regression: An Approach
to Regression Analysis by Local Fitting, Journal of the American Statistical Associa-
tion, 83(403), pp. 596-610.
[19] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learning, 20(3),
pp. 273-297.
[22] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding Al-
gorithm for Linear Inverse Problems with a Sparsity Constraint, Communications on
Pure and Applied Mathematics, 57(11), pp. 1413-1457.
[24] Donoho D.L. and Johnstone I.M. (1994), Ideal Spatial Adaptation via Wavelet
Shrinkage, Biometrika, 81(3), pp. 425-455.
[25] Donoho D.L. and Johnstone I.M. (1995), Adapting to Unknown Smoothness via
Wavelet Shrinkage, Journal of the American Statistical Association, 90(432), pp. 1200-
1224.
[26] Doucet A., De Freitas N. and Gordon N. (2001), Sequential Monte Carlo in Prac-
tice, Springer.
[27] Ehlers J.F. (2001), Rocket Science for Traders: Digital Signal Processing Applications,
John Wiley & Sons.
[28] Elton E.J. and Gruber M.J. (1972), Earnings Estimates and the Accuracy of Expec-
tational Data, Management Science, 18(8), pp. 409-424.
[29] Engle R.F. and Granger C.W.J. (1987), Co-Integration and Error Correction: Rep-
resentation, Estimation, and Testing, Econometrica, 55(2), pp. 251-276.
[30] Fama E. (1970), Efficient Capital Markets: A Review of Theory and Empirical Work,
Journal of Finance, 25(2), pp. 383-417.
[31] Flandrin P., Rilling G. and Goncalves P. (2004), Empirical Mode Decomposition
as a Filter Bank, Signal Processing Letters, 11(2), pp. 112-114.
[32] Fliess M. and Join C. (2009), A Mathematical Proof of the Existence of Trends in
Financial Time Series, in El Jai A., Afifi L. and Zerrik E. (eds), Systems Theory:
Modeling, Analysis and Control, Presses Universitaires de Perpignan, pp. 43-62.
[33] Fuentes M. (2002), Spectral Methods for Nonstationary Spatial Processes, Biometrika,
89(1), pp. 197-210.
[34] Gençay R., Selçuk F. and Whitcher B. (2002), An Introduction to Wavelets and
Other Filtering Methods in Finance and Economics, Academic Press.
54
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
[35] Gestel T.V., Suykens J.A.K., Baestaens D., Lambrechts A., Lanckriet G.,
Vandaele B., De Moor B. and Vandewalle J. (2001), Financial Time Series Pre-
diction Using Least Squares Support Vector Machines Within the Evidence Framework,
IEEE Transactions on Neural Networks, 12(4), pp. 809-821.
[36] Golyandina N., Nekrutkin V.V. and Zhigljavsky A.A. (2001), Analysis of Time
Series Structure: SSA and Related Techniques, Chapman & Hall, CRC.
[37] Gonzalo J. and Granger C.W.J. (1995), Estimation of Common Long-Memory Com-
ponents in Cointegrated Systems, Journal of Business & Economic Statistics, 13(1), pp.
27-35.
[38] Grinblatt M., Titman S. and Wermers R. (1995), Momentum Investment Strate-
gies, Portfolio Performance, and Herding: A Study of Mutual Fund Behavior, American
Economic Review, 85(5), pp. 1088-1105.
[40] Grossmann A. and Morlet J. (1984), Decomposition of Hardy Functions into Square
Integrable Wavelets of Constant Shape, SIAM Journal of Mathematical Analysis, 15,
pp. 723-736.
[42] Harvey A.C. (1989), Forecasting, Structural Time Series Models and the Kalman Fil-
ter, Cambridge University Press.
[43] Harvey A.C. and Trimbur T.M. (2003), General Model-Based Filters for Extracting
Cycles and Trends in Economic Time Series, Review of Economics and Statistics, 85(2),
pp. 244-255.
[44] Hastie T., Tibshirani R. and Friedman R. (2009), The Elements of Statistical Learn-
ing, second edition, Springer.
[46] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An Empirical
Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16.
[47] Holt C.C. (1959), Forecasting Seasonals and Trends by Exponentially Weighted Mov-
ing Averages, ONR Research Memorandum, 52, reprinted in International Journal of
Forecasting, 2004, 20(1), pp. 5-10.
[48] Hong H. and Stein J.C. (1977), A Unified Theory of Underreaction, Momentum Trad-
ing and Overreaction in Asset Markets, NBER Working Paper, 6324.
[51] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via Flexible
Least Squares, Computers & Mathematics with Applications, 17, pp. 1215-1245.
Q U A N T R E S E A R C H B Y LY X O R
55
[52] Kalman R.E. (1960), A New Approach to Linear Filtering and Prediction Problems,
Transactions of the ASME – Journal of Basic Engineering, 82(D), pp. 35-45.
[54] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009), 1 Trend Filtering, SIAM
Review, 51(2), pp. 339-360.
[56] Macaulay F. (1931), The Smoothing of Time Series, National Bureau of Economic
Research.
[57] Mallat S.G. (1989), A Theory for Multiresolution Signal Decomposition: The Wavelet
Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence,
11(7), pp. 674-693.
[58] Mann H.B. (1945), Nonparametric Tests against Trend, Econometrica, 13(3), pp. 245-
259.
[60] Muth J.F. (1960), Optimal Properties of Exponentially Weighted Forecasts, Journal
of the American Statistical Association, 55(290), pp. 299-306.
[61] Oppenheim A.V. and Schafer R.W. (2009), Discrete-Time Signal Processing, third
edition, Prentice-Hall.
[62] Peña D. and Box, G.E.P. (1987), Identifying a Simplifying Structure in Time Series,
Journal of the American Statistical Association, 82(399), pp. 836-843.
[64] Pollock D.S.G. (2009), Statistical Signal Extraction: A Partial Survey, in Kon-
toghiorges E. and Belsley D.E. (eds.), Handbook of Empirical Econometrics, John Wiley
and Sons.
[65] Rao S.T. and Zurbenko I.G. (1994), Detecting and Tracking Changes in Ozone air
Quality, Journal of Air and Waste Management Association, 44(9), pp. 1089-1092.
[67] Savitzky A. and Golay M.J.E. (1964), Smoothing and Differentiation of Data by
Simplified Least Squares Procedures, Analytical Chemistry, 36(8), pp. 1627-1639.
[68] Silverman B.W. (1985), Some Aspects of the Spline Smoothing Approach to Non-
Parametric Regression Curve Fitting, Journal of the Royal Statistical Society, B47(1),
pp. 1-52.
[69] Sorenson H.W. (1970), Least-Squares Estimation: From Gauss to Kalman, IEEE
Spectrum, 7, pp. 63-68.
56
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
[70] Stock J.H. and Watson M.W. (1988), Variable Trends in Economic Time Series,
Journal of Economic Perspectives, 2(3), pp. 147-174.
[71] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Financial Times
Series Forecasting, Neurocomputing, 48(1-4), pp. 847-861.
[72] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of
the Royal Statistical Society, B58(1), pp. 267-288.
[73] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons, New York.
Q U A N T R E S E A R C H B Y LY X O R
57
58
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Q U A N T R E S E A R C H B Y LY X O R
59
60
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Disclaimer
Each of this material and its content is confidential and may not be reproduced or provided
to others without the express written permission of Lyxor Asset Management (“Lyxor AM”).
This material has been prepared solely for informational purposes only and it is not intended
to be and should not be considered as an offer, or a solicitation of an offer, or an invitation
or a personal recommendation to buy or sell participating shares in any Lyxor Fund, or
any security or financial instrument, or to participate in any investment strategy, directly
or indirectly.
It is intended for use only by those recipients to whom it is made directly available by Lyxor
AM. Lyxor AM will not treat recipients of this material as its clients by virtue of their
receiving this material.
This material reflects the views and opinions of the individual authors at this date and in
no way the official position or advices of any kind of these authors or of Lyxor AM and thus
does not engage the responsibility of Lyxor AM nor of any of its officers or employees. All
performance information set forth herein is based on historical data and, in some cases, hy-
pothetical data, and may reflect certain assumptions with respect to fees, expenses, taxes,
capital charges, allocations and other factors that affect the computation of the returns.
Past performance is not necessarily a guide to future performance. While the information
(including any historical or hypothetical returns) in this material has been obtained from
external sources deemed reliable, neither Société Générale (“SG”), Lyxor AM, nor their af-
filiates, officers employees guarantee its accuracy, timeliness or completeness. Any opinions
expressed herein are statements of our judgment on this date and are subject to change with-
out notice. SG, Lyxor AM and their affiliates assume no fiduciary responsibility or liability
for any consequences, financial or otherwise, arising from, an investment in any security or
financial instrument described herein or in any other security, or from the implementation
of any investment strategy.
Lyxor AM and its affiliates may from time to time deal in, profit from the trading of, hold,
have positions in, or act as market makers, advisers, brokers or otherwise in relation to the
securities and financial instruments described herein.
Service marks appearing herein are the exclusive property of SG and its affiliates, as the
case may be.
This material is communicated by Lyxor Asset Management, which is authorized and reg-
ulated in France by the “Autorité des Marchés Financiers” (French Financial Markets Au-
thority).
c
2011 LYXOR ASSET MANAGEMENT ALL RIGHTS RESERVED
Q U A N T R E S E A R C H B Y LY X O R
61
The Lyxor White Paper Series is a quarterly publication providing our
clients access to intellectual capital, risk analytics and quantitative
research developed within Lyxor Asset Management. The Series
covers in depth studies of investment strategies, asset allocation
methodologies and risk management techniques. We hope you will
find the Lyxor White Paper Series stimulating and interesting.
PUBLISHING DIRECTORS
Alain Dubois, Chairman of the Board
Laurent Seyer, Chief Executive Officer
EDITORIAL BOARD
Nicolas Gaussel, PhD, Managing Editor.
Thierry Roncalli, PhD, Associate Editor
Benjamin Bruder, PhD, Associate Editor