Momentum-Strategies PDF

University of Paris 7 - Lyxor Asset Management
Master thesis
Momentum Strategies:
From novel Estimation Techniques to
Financial Applications
Author: Supervisor:
Tung-Lam Dao Prof. Thierry Roncalli
September 30, 2011
Electronic copy available at: http://ssrn.com/abstract=2358988

Electronic copy available at: http://ssrn.com/abstract=2358988
Contents
Acknowledgments ix
Confidential notice xi
Introduction xiii
1 Trading Strategies with L1 Filtering 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 L1 filtering schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Application to trend-stationary process . . . . . . . . . . . . 3
1.3.2 Extension to mean-reverting process . . . . . . . . . . . . . . 4
1.3.3 Mixing trend and mean-reverting properties . . . . . . . . . . 8
1.3.4 How to calibrate the regularization parameters? . . . . . . . . 8
1.4 Application to momentum strategies . . . . . . . . . . . . . . . . . . 13
1.4.1 Estimating the optimal filter for a given trading date . . . . . 13
1.4.2 Backtest of a momentum strategy . . . . . . . . . . . . . . . . 15
1.5 Extension to the multivariate case . . . . . . . . . . . . . . . . . . . 16
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Volatility Estimation for Trading Strategies 21

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Range-based estimators of volatility . . . . . . . . . . . . . . . . . . 22
2.2.1 Range based daily data . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Basic estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 High-low estimators . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.4 How to eliminate both drift and opening effects? . . . . . . . 28
2.2.5 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . 29
2.2.6 Backtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Estimation of realized volatility . . . . . . . . . . . . . . . . . . . . . 42
2.3.1 Moving-average estimator . . . . . . . . . . . . . . . . . . . . 42
2.3.2 IGARCH estimator . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.3 Extension to range-based estimators . . . . . . . . . . . . . . 45
2.3.4 Calibration procedure of the estimators of realized volatility . 45
2.4 High-frequency volatility estimators . . . . . . . . . . . . . . . . . . . 50
i
2.4.1 Microstructure effect . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.2 Two time-scale volatility estimator . . . . . . . . . . . . . . . 52
2.4.3 Numerical implementation and backtesting . . . . . . . . . . 55
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Support Vector Machine in Finance 59

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Support vector machine at a glance . . . . . . . . . . . . . . . . . . . 60
3.2.1 Basic ideas of SVM . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 ERM and VRM frameworks . . . . . . . . . . . . . . . . . . . 65
3.3 Numerical implementations . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.1 Dual approach . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Primal approach . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.3.3 Model selection - Cross validation procedure . . . . . . . . . . 76
3.4 Extension to SVM multi-classification . . . . . . . . . . . . . . . . . 77
3.4.1 Basic idea of multi-classification . . . . . . . . . . . . . . . . . 77
3.4.2 Implementations of multiclass SVM . . . . . . . . . . . . . . . 78
3.5 SVM-regression in finance . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5.1 Numerical tests on SVM-regressors . . . . . . . . . . . . . . . 83
3.5.2 SVM-Filtering for forecasting the trend of signal . . . . . . . 84
3.5.3 SVM for multivariate regression . . . . . . . . . . . . . . . . . 87
3.6 SVM-classification in finance . . . . . . . . . . . . . . . . . . . . . . 91
3.6.1 Test of SVM-classifiers . . . . . . . . . . . . . . . . . . . . . . 91
3.6.2 SVM for classification . . . . . . . . . . . . . . . . . . . . . . 95
3.6.3 SVM for score construction and stock selection . . . . . . . . 98
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4 Analysis of Trading Impact in the CTA strategy 109

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Conclusions 113
A Appendix of chaper 1 115

A.1 Computational aspects of L1 , L2 filters . . . . . . . . . . . . . . . . . 115
A.1.1 The dual problem . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.1.2 The interior-point algorithm . . . . . . . . . . . . . . . . . . . 117
A.1.3 The scaling of smoothing parameter of L1 filter . . . . . . . . 118
A.1.4 Calibration of the L2 filter . . . . . . . . . . . . . . . . . . . . 119
A.1.5 Implementation issues . . . . . . . . . . . . . . . . . . . . . . 121
B Appendix of chapter 2 123

B.1 Estimator of volatility . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.1.1 Estimation with realized return . . . . . . . . . . . . . . . . . 123
ii
C Appendix of chapter 3 125
C.1 Dual problem of SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 125
C.1.1 Hard-margin SVM classifier . . . . . . . . . . . . . . . . . . . 125
C.1.2 Soft-margin SVM classifier . . . . . . . . . . . . . . . . . . . . 126
C.1.3 ε-SV regression . . . . . . . . . . . . . . . . . . . . . . . . . . 127
C.2 Newton optimization for the primal problem . . . . . . . . . . . . . . 128
C.2.1 Quadratic loss function . . . . . . . . . . . . . . . . . . . . . 128
C.2.2 Soft-margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . 129
Published paper 131
iii
List of Figures
1.1 L1 − T filtering versus HP filtering for the model (1.2) . . . . . . . . 5

1.2 L1 -T filtering versus HP filtering for the model (1.3) . . . . . . . . . 5
1.3 L1 − C filtering versus HP filtering for the model (1.5) . . . . . . . . 7
1.4 L1 − C filtering versus HP filtering for the model (1.6) . . . . . . . . 7
1.5 L1 − T C filtering versus HP filtering for the model (1.2) . . . . . . . 8
1.6 L1 − T C filtering versus HP filtering for the model (1.3) . . . . . . . 9
1.7 Influence of the smoothing parameter λ . . . . . . . . . . . . . . . . 10
1.8 Scaling power law of the smoothing parameter λmax . . . . . . . . . 11
1.9 Cross-validation procedure for determining optimal value λ? . . . . . 11
1.10 Calibration procedure with the S&P 500 index . . . . . . . . . . . . 13
1.11 Cross validation procedure for two-trend model . . . . . . . . . . . . 13
1.12 Comparison between different L1 filters on S&P 500 Index . . . . . . 14
2.1 Data set of 1 trading day . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Volatility estimators without drift and opening effects (M = 50) . . . 30
2.3 Volatility estimators without drift and opening effect (M = 500) . . 31
2.4 Volatility estimators with µ = 30% and without opening effect (M = 500) 31
2.5 Volatility estimators with opening effect f = 0.3 and without drift
(M = 500) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Volatility estimators with correction of the opening jump (f = 0.3) . 32
2.7 Volatility estimators on stochastic volatility simulation . . . . . . . . 33
2.8 Test of voltarget strategy with stochastic volatility simulation . . . . 34
2.9 Test of voltarget strategy with stochastic volatility simulation . . . . 35
2.10 Comparison between different probability density functions . . . . . 36
2.11 Comparison between the different cumulative distribution functions . 36
2.12 Volatility estimators on S& P 500 index . . . . . . . . . . . . . . . . 37
2.13 Volatility estimators on on BHI UN Equity . . . . . . . . . . . . . . 37
2.14 Estimation of the closing interval for S&P 500 index . . . . . . . . . 38
2.15 Estimation of the closing interval for BHI UN Equity . . . . . . . . . 38
2.16 Likehood function for various estimators on S&P 500 . . . . . . . . . 39
2.17 Likehood function for various estimators on BHI UN Equity . . . . . 40
2.18 Backtest of voltarget strategy on S&P 500 index . . . . . . . . . . . 41
2.19 Backtest of voltarget strategy on BHI UN Equity . . . . . . . . . . . 41
2.20 Comparison between IGARCH estimator and CC estimator . . . . . 46
v
2.21 Likehood function of high-low estimators versus filtered parameter β 47
2.22 Likehood function of high-low estimators versus effective moving window 48
2.23 IGARCH estimator versus moving-average estimator for close-to-close
prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.24 Comparison between different IGARCH estimators for high-low prices 49
2.25 Daily estimation of the likehood function for various close-to-close
estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.26 Daily estimation of the likehood function for various high-low estimators 50
2.27 Backtest for close-to-close estimator and realized estimators . . . . . 51
2.28 Backtest for IGARCH high-low estimators comparing to IGARCH
close-to-close estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.29 Two-time scale estimator of intraday volatility . . . . . . . . . . . . . 56
3.1 Geometric interpretation of the margin in a linear SVM. . . . . . . . 61

3.2 Binary decision tree strategy for multiclassification problem . . . . . 80
3.3 L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 84
3.5 Comparison of different regression kernel for model (3.16) . . . . . . 85
3.6 Comparison of different regression kernel for model (3.17) . . . . . . 86
3.7 Cross-validation procedure for determining optimal value C ? σ ? . . . 87
3.8 SVM-filtering with fixed horizon scheme . . . . . . . . . . . . . . . . 88
3.9 SVM-filtering with dynamic horizon scheme . . . . . . . . . . . . . . 88
3.11 Comparison of different kernels for multivariate regression . . . . . . 90
3.12 Comparison between Dual algorithm and Primal algorithm . . . . . . 92
3.13 Illustration of non-linear classification with Gaussian kernel . . . . . 92
3.14 Illustration of multiclassification with SVM-BDT for in-sample data 93
3.15 Illustration of multiclassification with SVM-BDT for out-of-sample data 94
3.16 Illustration of multiclassification with SVM-BDT for = 0 . . . . . . 94
3.17 Illustration of multiclassification with SVM-BDT for = 0.2 . . . . . 95
3.18 Multiclassification with SVM-BDT on training set . . . . . . . . . . 96
3.19 Prediction efficiency with SVM-BDT on the validation set . . . . . . 97
3.20 Comparison between simulated score and Probit score for d = 2 . . . 101
3.21 Comparison between simulated score CDF and Probit score CDF for
d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.22 Comparison between simulated score PDF and Probit score PDF for
d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.23 Selection curve for long strategy for simulated data and Probit model 103
3.24 Probit scores for Eurostoxx data with d = 20 factors . . . . . . . . . 104
3.25 SVM scores for Eurostoxx data with d = 20 factors . . . . . . . . . . 105
A.1 Spectral density of moving-average and L2 filters . . . . . . . . . . . 120

A.2 Relationship between the value of λ and the length of the moving-
average filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
vi
List of Tables
1.1 Results for the Backtest . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Estimation error for various estimators . . . . . . . . . . . . . . . . . 34

2.2 Performance of σ̂HL
2 versus σ̂CC
2 for different averaging windows . . . 42
2.3 Performance of σ̂H L versus σ̂CC
2 2 for different filters of f . . . . . . . 42
vii
Acknowledgments
During the six months unforgettable in the R&D team of Lyxor Management, I have
experienced and enjoyed every moments. Apart from all the professional experiences
that I have learnt from everyones int the department, I did really appreciate the
great ambiance in the team which motivated me everyday.
I would like first to thank Thierry Roncalli for his supervision during my stay in
the team. I did not ever imagine that I could learn so many interesting things within
my internship without his direction and his confidence. Thierry has introduced me
the financial concepts of the asset management world in a very interactive way. I
would say that I have learnt finance in every single discussion with him. He taught
me how to combine learning and practice. For the professional experiences, Thierry
has help me to fill the lag in my financial knowledges by allowing me to work on
various interesting topics. He made me confident to present my understanding on
this field. For the daily life, Thierry has shared his own experiences and teach me as
well how to adapt to this new world.
I would like to thank Nicolas Gaussel for his warming reception in Quantitative
management department, for his confidence and for his encouragements during my
stay in Lyxor. I have a chance to work with him on a very interesting topic concerning
the CTA strategy which plays an important role in asset management. I would like
to thank Benjamin Bruder, my nearest neighbor, for his guide and his supervision
along my internship. Informally, Benjamin is almost my co-advisor. I must say that
I owe him a lot for all of his patience in every daily discussion in order to teach me
and to work out many questions coming up to my projects. I am really graceful for
his humorist quality which warm up the ambiance.
For all members of the R&D team, I would like to express my gratitude to them
for their helps, their advices and everything that they shared with me during my stay.
I am really happy to be one of them. Thank Jean-Charles for your friendship, for
all daily discussions and for your support for all initiatives in my projects. A great
thank to Stephane who always cheer up all the breaks with his intelligent humor. I
would say that I have learnt from him the most interesting view of the “Binomial
world” . Thank Karl for your explanation to your macro-world. Thank Pierre for all
your help on data collection and your passion in all explanation such as the story of
“Merrill lynch’s investment clock”. Thank Zelia for very stimulated collaboration on
my last project and the great time during our internship.
For all persons in the other side of the room, I would like to thank Philippe
Balthazard for his comments on my projects and his point of view on financial
aspects. Thank Hoang-Phong Nguyen for his help on data base and his support
during my stay. There are many other persons that I have chance to be in interaction
with but I could not cite here.
Thank to my parents, my sister who always believe in me and support me during

my deviation to a new direction. In the end, I would like reserve the greatest thank
to my wife and my son for their love and daily encouragement. They were always
behind me during the most difficult moments of this year.
x
Confidential notice
This thesis is sujected to confidential researchs in the R&D team of Lyxor Asset
Management. It is divided into two main parts. The first part including three first
chapers 1,2 and 3 consists of applications of novel estimation techniques for the
trend and the volatility of financial time series. We will present the main results
in detail together with a publication in the Lyxor White Paper series. The second
part concerning the analysis in the risk-return framework (see The Lyxor White
Paper Series, Issue #7, June 2011) of the CTA performance will be skipped due the
confidentiality. Only a brief introduction and the final conclusion of this part (chaper
4) will be presented in order to sketch out the main features.
This document contains information confidential and proprietary to Lyxor Asset

Management. The information may not be used, disclosed or reproduced without the
prior written authorization of Lyxor Asset Management and those so authorized may
only use the information for the purpose of evaluation consistent with authorization.
Introduction
Within the internship in the Research and Development team of Lyxor Asset Man-
agement, we studied novel technologies which are applicable on asset management.
We focused on the analysis of some special classes of momentum strategies such as
the trend-following strategies or the voltarget strategies. These strategies play a
crucial role in the quantitative management as they pretend to optimize the benefit
basing on exploitable signals of the market inefficiency and to limit the market risk
via an efficient control of the volatility.
The objectives of this report are two-fold. We first studied some novel tech-
niques in statistic and signal treatment fields such as trend filtering, daily and high
frequency volatility estimator or support vector machine. We employed these tech-
niques to extract interesting financial signals. These signals are used to implement
the momentum strategies which will be described in detail in every chapters of this re-
port. The second objective concerns the study of the performance of these strategies
based on the general risk-return analysis framework (see B. Bruder and N. Gaussel
7th White Paper, Lyxor). This report is organized as following:
In the first chapter, we discuss various implementation of L1 filtering in order

to detect some properties of noisy signals. This filter consists of using a L1 penalty
condition in order to obtain the filtered signal composed by a set of straight trends
or steps. This penalty condition, which determines the number of breaks, is imple-
mented in a constrained least square problem and is represented by a regularization
parameter λ which is estimated by a cross-validation procedure. Financial time series
are usually characterized by a long-term trend (called the global trend) and some
short-term trends (which are named local trends). A combination of these two time
scales can form a simple model describing the process of a global trend process with
some mean-reverting properties. Explicit applications to momentum strategies are
also discussed in detail with appropriate uses of the trend configurations.
We next review in the second chapter various techniques for estimating the volatil-
ity. We start by discussing the estimators based on the range of daily monitoring data
then we consider the stochastic volatility model in order to determine the instanta-
neous volatility. At high trading frequency, the stock prices are fluctuated by an
additional noise, so-called the micro-structure noise. This effect comes from the bid-
ask spread and the short time scale. Within a short time interval, the trading price
does not reflect exactly the equilibrium price determined by the “supply-demand”
but bounces between the bid and ask prices. In the second part, we discuss the
effect of the micro-structure noise on the volatility estimation. It is a very important
topic concerning a large field of “high-frequency” trading. Examples of backtesting
on index and stocks will illustrate the efficiency of considered techniques.
The third chapter is dedicated to the study of general framework of machine-

learning technique. We review the well-known machine learning techniques so-called
support vector machine (SVM). This technique can be employed in different contexts
such as classification, regression or density estimation according to Vapnik [1998].
Within the scope of this report, we would like first to give an overview of this method
and its numerical variation implementation, then bridge it to financial applications
such as trend forecasting, the stock selection, sector recognition or score construction.
We finish in Chapter 4 by the performance analysis of CTA strategy. We review

first the trend-following strategies within Kalman filter and study the impact of the
trend estimator error. We start the discussion with the case of momentum strategy
on the single asset case then generalize the analysis to the multi-asset case. In order
to construct the allocation strategy, we employ the observed trend which is filtered
by exponential moving average. It can be demonstrated that the cumulated return of
the strategy can be splited into two important parts. The first one is called “Option
Profile” which involves only the current measured trend. This idea is very similar
in concept to the straddle profile suggested by Fung and Hsied (2001). The second
part is called “Trading Impact“ which involves an integral of the measured trend over
the trading period. We focus on the second quantity by estimating its probability
distribution function and associated gain and loss expectations. We illustrate how
the number of assets and their correlations influence the performance of a strategy
via a “toy model”. This study can reveal important results which can be directly
tested on a CTA funds.
xiv
Chapter 1
Trading Strategies with L1

Filtering
In this chapter, we discuss various implementation of L1 filtering in order to detect

some properties of noisy signals. This filter consists of using a L1 penalty condition
in order to obtain the filtered signal composed by a set of straight trends or steps.
This penalty condition, which determines the number of breaks, is implemented in a
constrained least square problem and is represented by a regularization parameter λ
which is estimated by a cross-validation procedure. Financial time series are usually
characterized by a long-term trend (called the global trend) and some short-term
trends (which are named local trends). A combination of these two time scales can
form a simple model describing the process of a global trend process with some
mean-reverting properties. Explicit applications to momentum strategies are also
discussed in detail with appropriate uses of the trend configurations.
Keywords: Momentum strategy, L1 filtering, L2 filtering, trend-following, mean-

reverting.
1.1 Introduction
Trend detection is a major task of time series analysis from both mathematical and
financial point of view. The trend of a time series is considered as the component
containing the global change which is in contrast to the local change due to the
noise. The procedure of trend filtering concerns not only the problem of denoising
but it must take into account also the dynamic of the underlying process. That
explains why mathematical approaches to trend extraction have a long history and
this subject still gives a great interest in the scientific community 1 . In an investment
perspective, trend filtering is the core of most momentum strategies developed in the
asset management industry and the hedge funds community in order to improve
performance and to limit risk of portfolios.
1
For a general review, see Alexandrov et al. (2008).
1
Trading Strategies with L1 Filtering
The paper is organized as follows. In section 2, we discuss the trend-cycle decom-

position of time series and review general properties of L1 and L2 filtering. In section
3, we describe the L1 filter with its various extensions and the calibration procedure.
In section 4, we apply L1 filters to some momentum strategies and present the re-
sults of some backtests with the S&P 500 index. In section 5, we discuss the possible
extension to the multivariate case and we conclude in the last section.
1.2 Motivations
In economics, the trend-cycle decomposition plays an important role to describe
a non-stationary time series into permanent and transitory stochastic components.
Generally, the permanent component is assimilated to a trend whereas the transitory
component may be a noise or a stochastic cycle. Moreover, the literature on business
cycle has produced a large number of empirical research on this topic (see for example
Cleveland and Tiao (1976), Beveridge and Nelson (1991), Harvey (1991) or Hodrick
and Prescott (1997)). These last authors have then introduced a new method to
estimate the trend of long-run GDP. The method widely used by economists is based
on L2 filtering. Recently, Kim et al. (2009) have developed a similar filter by
replacing the L2 penalty function by a L1 penalty function.
Let us consider a time series yt which can be decomposed by a slowly varying

trend xt and a rapidly varying noise εt process:
yt = xt + εt
Let us first remind the well-known L2 filter (so-called Hodrick-Prescott filter). This
scheme consists to determine the trend xt by minimizing the following objective
function:
n n−1
1X X
(yt − xt )2 + λ (xt−1 − 2xt + xt+1 )2
2
t=1 t=2
with λ > 0 the regularization parameter which control the competition between the
smoothness of xt and the residual yt −xt (or the noise εt ). We remark that the second
term is the discrete derivative of the trend xt which characterizes the smoothness of
the curve. Minimizing this objective function gives a solution which is the trade-off
between the data and the smoothness of its curvature. In finance, this scheme does
not give a clear signature of the market tendency. By contrast, if we replace the
L2 norm by the L1 norm in the objective function, we can obtain more interesting
properties. Therefore, Kim et al. (2009) propose to consider the following objective
function:
n n−1
1X 2
X
(yt − xt ) + λ |xt−1 − 2xt + xt+1 |
2
t=1 t=2
This problem is closely related to the Lasso regression of Tibshirani (1996) or the L1
regularized least square problem of Daubechies et al. (2004). Here, the fact of taking
the L1 norm will impose the condition that the second derivation of the filtered signal
2
must be zero. Hence, the filtered signal is composed by a set of straight trends and
breaks2 . The competition between these two terms in the objective function turns
to the competition between the number of straight trends (or number of breaks)
and the closeness to the raw data. Therefore, the smoothing parameter λ plays an
important role for detecting the number of breaks. In the later, we present briefly
how the L1 filter works for the trend detection and its extension to mean-reverting
processes. The calibration procedure for λ parameter will be also discussed in detail.
1.3 L1 filtering schemes

1.3.1 Application to trend-stationary process
The Hodrick-Prescott scheme discussed in last section can be rewritten in the vec-
torial space Rn and its L2 norm k·k2 as:
1
ky − xk22 + λ kDxk22
2
where y = (y1 , . . . , yn ), x = (x1 , . . . , xn ) ∈ Rn and the D operator is the (n − 2) × n
matrix:  
1 −2 1
 1 −2 1 
 
 . . 
(1.1)
D=
 . 

 1 −2 1 
1 2 1
The exact solution of this estimation is given by
−1
x? = I + 2λD> D y
The explicit expression of x? allows a very simple numerical implementation with

sparse matrix. As L2 filter is a linear filter, the regularization parameter λ is cali-
brated by comparing to the usual moving-average filter. The detail of the calibration
procedure is given in Appendix A.1.4.
The idea of L2 filter can be generalized to a lager class so-called Lp filter by using
Lp penalty condition instead of L2 penalty. This generalization is already discussed
in the work of Daubechies et al. (2004) for the linear inverse problem or in the
Lasso regression problem by Tibshirani et al. (1996). If we consider a L1 filter, the
objective function becomes:
n n−1
1X X
(yt − xt )2 + λ |xt−1 − 2xt + xt+1 |
2
t=1 t=2
2
A break is the position where the trend of signal changes.
3
which is equivalent to the following vectorial form:

1
2
It has been demonstrated in Kim et al. (2009) that the dual problem of this L1
filter scheme is a quadratic program with some boundary constraints. The detail
of this derivation is shown in Appendix A.1.1. In order to optimize the numerical
computation speed, we follow Kim et al. (2009) by using a “primal-dual interior
point” method (see Appendix A.1.2). In the following, we check the efficient of this
technique on various trend-stationary processes.
The first model consists of data simulated by a set of straight trend lines with a
white noise perturbation:


 yt = xt + εt


 εt ∼ N 0, σ 2
xt = xt−1 + vt (1.2)



 Pr {v
t = v t−1 } = p

Pr vt = b U[0,1] − 12 = 1 − p
We present in Figure 2.19 the comparison between L1 − T and HP filtering schemes3 .

The top-left graph is the real trend xt whereas the top-right graph presents the noisy
signal yt . The bottom graphs show the results of the L1 − T and HP filters. Here,
we have chosen λ = 5 258 for the L1 − T filtering and λ = 1 217 464 for HP filtering.
This choice of λ for L1 − T filtering is based on the number of breaks in the trend,
which is fixed to 10 in this example4 . The second model model is a random walk
generated by the following process:


 yt = yt−1 + vt + εt

εt ∼ N 0, σ 2
(1.3)

 Pr {v
t = vt−1 } = p

Pr vt = b U[0,1] − 12 = 1 − p
We present in Figure 1.2 the comparison between L1 − T filtering and HP filtering

on this second model5 .
1.3.2 Extension to mean-reverting process

As shown in the last paragraph, the use of L1 penalty on the second derivative gives
the correct description of the signal tendency. Hence, similar idea can be applied
for other order of the derivatives. We present here the extension of this L1 filtering
technique to the case of mean-reverting processes. If we impose now the L1 penalty
3
We consider n = 2000 observations. The parameters of the simulation are p = 0.99, b = 0.5
and σ = 15.
4
We discuss how to obtain λ in the next section.
5
The parameters of the simulation are p = 0.993, b = 5 and σ = 15.
4
Figure 1.1: L1 − T filtering versus HP filtering for the model (1.2)
Signal Noisy signal
100 100
50 50
0 0
−50 −50
500 1000 1500 2000 500 1000 1500 2000

t t
L1 -T filter HP filter
100 100
50 50
0 0
−50 −50
500 1000 1500 2000 500 1000 1500 2000

t t
Figure 1.2: L1 -T filtering versus HP filtering for the model (1.3)
Signal Noisy signal
1500 1500
1000 1000
500 500
0 0
500 1000 1500 2000 500 1000 1500 2000

t t
L1 -T filter HP filter
1500 1500
1000 1000
500 500
0 0
500 1000 1500 2000 500 1000 1500 2000

t t
5
condition to the first derivative, we can expect to get the fitted signal with zero slope.
The cost of this penalty will be proportional to the number of jumps. In this case,
we would like to minimize the following objective function:
n n
1X X
(yt − xt )2 + λ |xt − xt−1 |
2
t=1 t=2
or in the vectorial form:
1
2
Here the D operator is (n − 1) × n matrix which is the discrete version of the first
order derivative:  
−1 1 0
 0 −1 1 0 
 
D=
 . .. 
 (1.4)
 
 −1 1 0 
−1 1
We may apply the same minimization algorithm as previously (see Appendix A.1.1).
To illustrate that, we consider the model with step trend lines perturbed by a white
noise process: 

 y t = x t + εt

εt ∼ N 0, σ 2
(1.5)

 Pr {x
t = xt−1 } = p

Pr xt = b U[0,1] − 12 = 1 − p
We employ this model for testing the L1 − C filtering and HP filtering adapted to
the first derivative6 , which corresponds to the following optimization program:
n n
1X X
min (yt − xt )2 + λ (xt − xt−1 )2
2
t=1 t=2
In Figure 1.3, we have reported the corresponding results7 . For the second test,
we consider a mean-reverting process (Ornstein-Uhlenbeck process) with mean value
following a regime switching process:



 t − yt−1 ) + εt
yt = yt−1 + θ(x
εt ∼ N 0, σ 2
(1.6)

 Pr {x
t = xt−1 } = p 1

Pr xt = b U[0,1] − 2 = 1 − p
Here, µt is the process which characterizes the mean value and θ is inversely propor-
tional to the return time to the mean value. In Figure 1.4, we show how the L1 − C
filter can capture the original signal in comparison to the HP filter8 .
6
We use the term HP filter in order to keep homogeneous notations. However, we notice that
this filter is indeed the FLS filter proposed by Kalaba and Tesfatsion (1989) when the exogenous
regressors are only a constant.
7
The parameters are p = 0.998, b = 50 and σ = 8.
8
For the simulation of the Ornstein-Uhlenbeck process, we have chosen p = 0.9985, b = 20,
θ = 0.1 and σ = 2
6
Figure 1.3: L1 − C filtering versus HP filtering for the model (1.5)
Signal Noisy signal

80 80
60 60
40 40
20 20
0 0
−20 −20
−40 −40
500 1000 1500 2000 500 1000 1500 2000

t t
L1 -C filter HP filter
80 80
60 60
40 40
20 20
0 0
−20 −20
−40 −40
500 1000 1500 2000 500 1000 1500 2000

t t
Figure 1.4: L1 − C filtering versus HP filtering for the model (1.6)
Signal Noisy signal

40 40
30 30
20 20
10 10
0 0
−10 −10
−20 −20
500 1000 1500 2000 500 1000 1500 2000
t t
L1 -C filter HP filter
40 40
30 30
20 20
10 10
0 0
−10 −10
−20 −20
500 1000 1500 2000 500 1000 1500 2000
t t
7
1.3.3 Mixing trend and mean-reverting properties

We now combine the two schemes proposed above. In this case, we define two regular-
Pn−1
ization parameters
Pn−1 λ 1 and λ2 corresponding to two penalty conditions t=1 |xt − xt−1 |
and t=2 |xt−1 − 2xt + xt+1 |. Our objective function for the primal problem be-
comes now:
n n−1 n−1
1X X X
(yt − xt )2 + λ1 |xt − xt−1 | + λ2 |xt−1 − 2xt + xt+1 |
2
t=1 t=1 t=2
which can be again rewritten in the matrix form:

1
ky − xk22 + λ1 kD1 xk1 + λ2 kD2 xk1
2
where the D1 and D2 operators are respectively the (n − 1) × n and (n − 2) × n
matrices defined in equations (1.4) and (1.1).
In Figures 1.5 and 1.6, we test the efficiency of the mixing scheme on the straight
trend lines model (1.2) and the random walk model (1.3)9 .
Figure 1.5: L1 − T C filtering versus HP filtering for the model (1.2)
Signal Noisy signal

100 100
50 50
0 0
−50 −50
−100 −100
500 1000 1500 2000 500 1000 1500 2000

t t
L1 -TC filter HP filter
100 100
50 50
0 0
−50 −50
−100 −100
500 1000 1500 2000 500 1000 1500 2000

t t
1.3.4 How to calibrate the regularization parameters?

As shown above, the trend obtained from L1 filtering depends on the parameter λ of
the regularization procedure. For large values of λ, we obtain the long-term trend of
9
For both models, the parameters are p = 0.99, b = 0.5 and σ = 5.
8
Figure 1.6: L1 − T C filtering versus HP filtering for the model (1.3)
Signal Noisy signal

1500 1500
1000 1000
500 500
0 0
−500 −500
500 1000 1500 2000 500 1000 1500 2000

t t
L1 -TC filter HP filter
1500 1500
1000 1000
500 500
0 0
−500 −500
500 1000 1500 2000 500 1000 1500 2000

t t
the data while for small values of λ, we obtain short-term trends of the data. In this
paragraph, we attempt to define a procedure which permits to do the right choice
on the smoothing parameter according to our need of trend extraction.
A preliminary remark
For small value of λ, we recover the original form of the signal. For large value of
λ, we remark that there exists a maximum value λmax above which the trend signal
has the affine form:
xt = α + βt
where α and β are two constants which do not depend on the time t. The value of
λmax is given by:
−1

λmax = DD >
Dy

∞
We can use this remark to get an idea about the order of magnitude of λ which
should be used to determine the trend over a certain time period T . In order to
show this idea, we take the data over the total period T . If we want to have the
global trend on this period, we fix λ = λmax . This λ will gives the unique trend for
the signal over the whole period. If one need to get more detail on the trend over
shorter periods, we can divide the signal into p time intervals and then estimate λ
9
via the mean value of all the λimax parameter:

p
1X i
λ= λmax
p
i=1
In Figure 1.7, we show the results obtained with p = 2 (λ = 1 500) and p = 6

(λ = 75) on the S&P 500 index.
Figure 1.7: Influence of the smoothing parameter λ
7.6 S&P 500

λ =999
λ =15
7.4
7.2
6.8
6.6
2007 2008 2009 2010 2011
Moreover, the explicit calculation of a Brownian motion process gives us the

scaling law of the the smoothing parameter λmax . For the trend filtering scheme,
λmax scales as T 5/2 while for the mean-reverting scheme, λmax scales as T 3/2 (see
Figure 1.8). Numerical calculation of these powers for 500 simulations of the model
(1.3) gives very good agreement with the analytical result for Brownian motion.
Indeed, we obtain empirically that the power for L1 − T filter is 2.51 while the one
for L1 − C filter is 1.52.
Cross validation procedure

In this paragraph, we discuss how to employ a cross-validation scheme in order
to calibrate the smoothing parameter λ of our model. We define two additional
parameters which characterize the trend detection mechanism. The first parameter
T1 is the width of the data windows to estimate the optimal λ with respect to our
target strategy. This parameter controls the precision of our calibration. The second
parameter T2 is used to estimate the prediction error of the trends obtained in the
10
Figure 1.8: Scaling power law of the smoothing parameter λmax
main window. This parameter characterizes the time horizon of the investment
strategy. Figure 3.7 shows how the data set is divided into different windows in the
Figure 1.9: Cross-validation procedure for determining optimal value λ?
Training set Test set Forecasting

| -| - | -
T1 T2 T2 -
| k
Historical data Today Prediction
cross validation procedure. In order to get the optimal parameter λ, we compute the
total error after scanning the whole data by the window T1 . The algorithm of this
calibration process is described as following:
11
Algorithm 1 Cross validation procedure for L1 filtering

procedure CV_Filter(T1 , T2 )
Divide the historical data by m rolling test sets T2i (i = 1, . . . , m)
For each test window T2i , compute the statistic λimax
From the array of λimax , compute the average λ̄ and the standard deviation
σλ
Compute the boundaries λ1 = λ̄ − 2σλ and λ2 = λ̄ + 2σλ
for j = 1 : n do
Compute λj = λ1 (λ2 /λ1 )(j/n)
Divide the historical data by p rolling training sets T1k (k = 1, . . . , p)
for k = 1 : p do
For each training window T1k , run the L1 filter
Forecast the trend for the adjacent test window T2k
Compute the error ek (λj ) on the test window T2k
end for P
Compute the total error e (λj ) = m k
k=1 e (λj )
end for
Minimize the total error e (λ) to find the optimal value λ?
Run the L1 filter with λ = λ?
end procedure
Figure 1.10 illustrates the calibration procedure for the S&P 500 index with
T1 = 400 and T2 = 50 for the S&P 500 index (the number of observations is equal
to 1 008 trading days). With m = p = 12 and n = 15, the estimated optimal value
λ? for the L1 − T filter is equal to 7.03.
We have observed that this calibration procedure is more favorable for long-term
time horizon, that is to estimate a global trend. For short-term time horizon, the
prediction of local trends is much more perturbed by the noise. We have computed
the probability of having good prediction on the tendency of the market for long-
term and short-term time horizons. This probability is about 70% for 3 months time
horizon while it is just 50% for one week time horizon. It comes that even if the fit is
good for the past, the noise is however large meaning that the prediction of the future
tendency is just 1/2 for an increasing market and 1/2 for a decreasing market. In order
to obtain better results for smaller time horizons, we improve the last algorithm by
proposing a two-trend model. The first trend is the local one which is determined by
the first algorithm with the parameter T2 corresponding to the local prediction. The
second trend is the global one which gives the tendency of the market over a longer
period T3 . The choice of this global trend parameter is very similar to the choice of
the moving-average parameter. This model can be considered as a simple version of
mean-reverting model for the trend. In Figure 1.11, we describe how the data set is
divided for estimating the local trend and the global trend.
The procedure for estimating the trend of the signal in the two-trend model is
summarized in Algorithm 2. The corrected trend is now determined by studying the
relative position of the historical data to the globaltrend. The reference position is
characterized by the standard deviation σ yt − xG t where xG t is the filtered global
12
Figure 1.10: Calibration procedure with the S&P 500 index
7.5
7
80
6.5
e(λ)
2007 2008 2009 2010 2011
60
40
−2 0 2 4 6 8
ln λ
trend.
1.4 Application to momentum strategies

In this section, we apply the previous framework to the S&P 500 index. First, we
illustrate the calibration procedure for a given trading date. Then, we backtest a
momentum strategy by estimating dynamically the optimal filters.
1.4.1 Estimating the optimal filter for a given trading date

We would like to estimate the optimal filter for January 3rd, 2011 by considering
the period from January 2007 to December 2010. We use the previous algorithms
Figure 1.11: Cross validation procedure for two-trend model

| - | - Global trend
- T3 T3
|
T1 | - | -
T2 T2 Local trend
| k -
13
Algorithm 2 Prediction procedure for the two-trend model

procedure Predict_Filter(Tl , Tg )
Compute the local trend xL t for the time horizon T2 with the CV_FILTER
procedure
Compute the global trend xG t for the time horizon T3 with the CV_FILTER
procedure
Compute the standard deviation σ yt − xG t of data with respect to the global
trend
if yt − xG G then
t < σ yt − xt
Prediction ← xL
t
else
Prediction ← xG
t
end if
end procedure
with T1 = 400 and T2 = 50. The optimal parameters are λ1 = 2.46 (for the L1 − C
filter) and λ2 = 15.94 (for the L2 − T filter). Results are reported in Figure 1.12.
The trend for the next 50 trading days is estimated to 7.34% for the L1 − T filter
and 7.84% for the HP filter whereas it is null for the L1 − C and L1 − T C filters. By
comparison, the true performance of the S&P 500 index is 1.90% from January 3rd,
2011 to March 15th, 201110 .
Figure 1.12: Comparison between different L1 filters on S&P 500 Index
10
It corresponds exactly to a period of 50 trading days
14
1.4.2 Backtest of a momentum strategy

Design of the strategy
Let us consider a class of self-financed strategies on a risky asset St and a risk-free
asset Bt . We assume that the dynamics of these assets is:
dBt = rt Bt dt
dSt = µt St dt + σt St dWt
where rt is the risk-free rate, µt is the trend of the asset price and σt is the volatility.
We denote αt the proportion of investment in the risky asset and (1 − αt ) the part
invested in the risk-free asset. We start with an initial budget W0 and expect a
final wealth WT . The optimal strategy is the one which optimizes the expectation of
the utility function U (WT ) which is increasing and concave. It is equivalent to the
Markowitz problem which consists of maximizing the wealth of the portfolio under
a penalty of risk:
α λ 2 α
sup E (WT ) − σ (WT )
α∈R 2
which is equivalent to:
λ 2 2
sup αt µt − W0 αt σt
α∈R 2
As the objective function is concave, the maximum corresponds to the zero point of
the gradient µt − λW0 αt σt2 . We obtain the optimal solution:
1 µt
αt? =
λW0 σt2
In order to limit the explosion of αt , we also impose the following constraint αmin ≤
αt ≤ αmax :
? 1 µt
αt = max min , αmin , αmax
λW0 σt2
The wealth of the portfolio is then given by the following expression:

? St+1 ?
Wt+1 = Wt + Wt αt − 1 + (1 − αt )rt
St
Results
In the following simulations, we use the estimators µ̂t and σ̂t in place of µt and σt .
For µ̂t , we consider different models like L1 , HP and moving-average filters11 whereas
we use the following estimator for the volatility:
Z t
2 1 T 2 1 X Si
σ̂t = σt dt = ln2
T 0 T Si−1
i=t−T +1
We consider a long/short strategy, that is (αmin , αmax ) = (−1, 1). In the particular
case of the µ̂L
t estimator, we consider three different models:
1
11
We note them respectively µ̂L
t , µ̂t
1 HP
and µ̂MA
t .
15
Table 1.1: Results for the Backtest

Model Trend Performance Volatility Sharpe IR Drawdown
S&P 500 2.04% 21.83% −0.06 56.78
µ̂MA
t 3.13% 18.27% −0.01 0.03 33.83
µ̂HP
t 6.39% 18.28% 0.17 0.13 39.60
µ̂L
t
1
(LT) 3.17% 17.55% −0.01 0.03 25.11
µ̂L
t
1
(GT) 6.95% 19.01% 0.19 0.14 31.02
L1
µ̂t (LGT) 6.47% 18.18% 0.17 0.13 31.99
1. the first one is based on the local trend;
2. the second one is based on the global trend;
3. the combination of both local and global trends corresponds to the third model.
For all these strategies, the test set of the local trend T2 is equal to 6 months (or
130 trading days) whereas the length of the test set for global trend is four times
the length of the test set – T3 = 4T2 – meaning that T3 is one year (or 520 trading
days). This choice of T3 agrees with the habitual choice of the width of the windows in
moving average estimator. The length of the training set is also four times the length
of the test set T1 . The study period is from January 1998 to December 2010. In the
backtest, the trend estimation is updated every day. In Table 2.3, we summarize the
results obtained with the different models cited above for the backtest. We remark
that the best performances correspond to the case of global trend, HP and two-trend
models. Because HP filter is calibrated to the window of the moving-average filter
which is equal to T3 , it is not surprising that the performances of these three models
are similar. On the considered period of the backtest, the S&P does not have a clear
upward or downward trend. Hence, the local trend estimator does not give a good
prediction and this strategy gives the worst performance. By contrast, the two-trend
model takes into account the trade-off between local trend and global trend and gives
a better result
1.5 Extension to the multivariate case

(1) (m)
We now extend the L1 filtering scheme to a multivariate time series yt = yt , . . . , yt .
The underlying idea is to estimate the common trend of several univariate time se-
ries. In finance, the time series correspond to the prices of several assets. Therefore,
we can build long/short strategies between these assets by comparing the individual
trends and the common trend.
For the sake of simplicity, we assume that all the signals are rescaled to the same
16
order of magnitude12 . The objective function becomes new:
1 X 2
m
(i)
y − x + λ kDxk1
2 2
i=1
In Appendix A.1.1, we show thatPthis problem is equivalent to the L1 univariate

problem by considering ȳt = m−1 m
i=1 y
(i) as the signal.
1.6 Conclusion
Momentum strategies are efficient ways to use the market tendency for building trad-
ing strategies. Hence, a good estimator of the trend is essential from this perspective.
In this paper, we show that we can use L1 filters to forecast the trend of the mar-
ket in a very simple way. We also propose a cross-validation procedure to calibrate
the optimal regularization parameter λ where the only information to provide is the
investment time horizon. More sophisticated models based on a local and global
trends is also discussed. We remark that these models can reflect the effect of mean-
reverting to the global trend of the market. Finally, we consider several backtests
on the S&P 500 index and obtain competing results with respect to the traditional
moving-average filter.
12
For example, we may center and standardize the time series by subtracting the mean and
dividing by the standard deviation.
17
Bibliography
[1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy
T. (2008), A Review of Some Modern Approaches to the Problem of Trend
Extraction , US Census Bureau, RRS #2008/03.
[2] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decompo-
sition of Economic Time Series into Permanent and Transitory Components
with Particular Attention to Measurement of the Business Cycle, Journal of
Monetary Economics, 7(2), pp. 151-174.
[3] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge Uni-

versity Press.
[4] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Se-
ries: A Model for the Census X-11 Program, Journal of the American Statistical
Association, 71(355), pp. 581-587.
[5] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding

Algorithm for Linear Inverse Problems with a Sparsity Constraint, Communi-
cations on Pure and Applied Mathematics, 57(11), pp. 1413-1457.
[6] Efron B., Tibshirani R. and Friedman R. (2009), The Elements of Statistical
Learning, Second Edition, Springer.
[7] Harvey A. (1991), Forecasting, Structural Time Series Models and the Kalman
Filter, Cambridge University Press.
[8] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An
Empirical Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16.
[9] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via

Flexible Least Squares, Computers & Mathematics with Applications, 17, pp.
1215-1245.
[10] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009), `1 Trend Filtering,
SIAM Review, 51(2), pp. 339-360.
[11] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Jour-
nal of the Royal Statistical Society B, 58(1), pp. 267-288.
19
Chapter 2
Volatility Estimation for Trading

Strategies
We review in this chapter various techniques for estimating the volatility. We start
by discussing the estimators based on the range of daily monitoring data then we con-
sider the stochastic volatility model in order to determine the instantaneous volatility.
At high trading frequency, the stock prices are fluctuated by an additional noise, so-
called the micro-structure noise. This effect comes from the bid-ask bounce due to
the short time scale. Within a short time interval, the trading price does not con-
verge to the equilibrium price determined by the “supply-demand” equilibrium. In
the second part, we discuss the effect of the micro-structure noise on the volatility es-
timation. It is very important topic concerning an enormous field of “high-frequency”
trading. Examples of backtesting on index and stocks will illustrate the efficiency of
considered techniques.
Keywords: Volatility, voltarget strategy, range-based estimator, high-low estima-

tor, microstructure noise.
2.1 Introduction
Measuring the volatility is one of the most important questions in finance. As stated
in its name, volatility is the direct measurement of the risk for a given asset. Under
the hypothesis that the realized return follows a Brownian motion, volatility is usually
estimated by the standard deviation of daily price movement. As this assumption
relates the stock price to the most common object of stochastic calculus, many
mathematical work have been carried out on the volatility estimation. With the
increasing of the trading data, we can explore more and more useful information in
order to improve the precision of the volatility estimator. New class of estimators
which are based on the high and low prices was invented. However, in the real world
the asset price is just not a simple geometric Brownian process, different effects
have been observed including the drift or the opening jump. A general correction
21
Volatility Estimation for Trading Strategies
scheme based on the combination of various estimators have been studied in order
to eliminate these effects.
As far as the trading frequency increases, we expect that the precision of estimator
gets better as well. However, when the trading frequency reaches certain limit1 , new
phenomena due to the nonequlibrum of the market emerge and spoil the precision. It
is called the micro-structure noise which is characterized by the bid-ask bounce or the
transaction effect. Because of this noise, realized variance estimator overestimates
the true volatility of the price process. A suggestion based on the use of two different
time scales can aim to eliminate this effect.
The note is organized as following. In Section II, we review the basic volatility
estimator using the variance of realized return (note from B.Bruder article) then we
introduce all the variation based on the range estimation. In section III, we discuss
how to measure the instantaneous volatility and the effect of the lag by doing the
moving-average. In section IV, we discuss the effect of the microstructure on the
high frequency volatility.
2.2 Range-based estimators of volatility

2.2.1 Range based daily data
In this paragraph, we discuss the general characteristics of the asset price and intro-
duce the basic notations which will be used for the rest of the article. Let us assume
that the dynamics of asset price follows the habitual Black-Scholes model. We denote
the asset price St which follows a geometric Brownian motion in continuous time:
dSt
= µt dt + σt dBt (2.1)
St
Here, µt is the return or the drift of the process whereas σt is the volatility. Over the
period of T = 1 trading day, the evolution is divided in two time intervals: the first
interval with ratio f describes the closing interval (before opening) and the second
interval with ratio 1 − f describes the opening interval (trading interval). On the
monitoring of the data, the closing interval is unobservable and is characterized by the
jumps in the opening of the market. The measure of closing interval is not given by
the real closing time but the jumps in the opening of the market. If the logarithm of
price follows a standard Brownian motion without drift, then the fraction f / (1 − f )
is given by the square of ratio between the standard deviation of the opening jump
and the daily price movement. We will see that this idea can give a first correction
due to the close-open effect for all the estimators discussed below.
In order to fix the notation, we define here different quantities concerning the
statistics of the price evolution:
• T is the time interval of 1 trading day
1
This limit defines the optimal frequency for the classical estimator. It is more and less agreed
to be one trade every 5 minutes.
22
Figure 2.1: Data set of 1 trading day
• f is the fraction of closing period
• σ̂t2 is the estimator of the variance σt2
• Oti is the closing price on a given period [ti , ti+1 [
• Cti is the closing price on a given period [ti , ti+1 [
• Hti = maxt∈[ti ,ti+1 [ St is the highest price on a given period [ti , ti+1 [
• Lti = mint∈[ti ,ti+1 [ St is the lowest price on a given period [ti , ti+1 [
• oti = ln Oti − ln Cti−1 is the opening jump
• uti = ln Hti − ln Oti is the highest price movement during the trading open
• dti = ln Lti − ln Oti is the lowest price movement during the trading open
• cti = ln Cti − ln Oti is the daily price movement over the trading open period
2.2.2 Basic estimator

For the sake of simplicity, let us start this paragraph by assuming that there is no
opening jump f = 0. The asset price St described by the process (3.17) is observed
in a series of discrete dates {t0 , ..., tn }. In general, this series is not necessary regular.
Let Rti be the realized return in the period [ti−1 , ti [, then we obtain:
Z
ti
1 2
Rti = ln Sti − ln Sti−1 = σu dBu + µu du − σu du
ti−1 2
In the following, we assume that the couple (µt , σt ) is independent to the Brownian
motion Bt of the asset price evolution.
23
Estimator over a given period

In appendix B.1, we show that the realized return Rti is related to the volatility as:
2
2 2 2 1 2
E Rti |σ, µ = (ti − ti−1 ) σti + (ti − ti−1 ) µti−1 − σti−1
2
This
√ quantity can 2not be a good estimator of volatility because its standard deviation
is 2 (ti+1 − ti ) σti which is proportional to the estimated quantity. In order to
reduce the estimation error, we focus on the estimation of the average volatility over
the period tn − t0 . The average volatility is defined as:
Z tn
2 1
σ = σ 2 du (2.2)
tn − t0 t0 u
This quantity can be measured by using the canonical estimator defined as:
n
1 X 2
σ̂ 2 = Rti
tn − t0
i=1

The variance of this estimator is approximated as var σ̂ 2 ≈ 2σ 4 /n or the standard
√ 2 √
deviation is proportional to 2σ / n. It means that the estimation error is small if
√
n is large enough. Indeed the variance of the average volatility reads var 2
σ̂ ≈
√
σ / (2n) and the standard deviation is approximated to σ/ 2n.
2
Effect of the weight distribution

In general, we can define an estimator with a weight distribution wi such as:
n
X
2
σ̂ = wi Rt2i
i=1
then the expectation value of the estimator is given by:

n Z
X ti
2

E σ̂ |σ, µ = wi σu2 du
i=1 ti−1
A simple example of the general definition is the estimator with annualized return
√
Ri / ti+1 − ti . In this case, our estimator becomes:
n
1 X Rt2i
σ̂ 2 =
n tn − t1
i=1
for which the expectation value is:

n Z
2 X 1 ti
E σ̂ |σ, µ = σu2 du (2.3)
ti − ti−1 ti−1
i=1
24
We remark that if the time step (time increment) is constant ti − ti−1 = T , then we
obtain the same result as the canonical estimator. However, if the time step ti − ti−1
is not constant, the long-term return is underweighted while the short-term return
is overweighted. We will see in the next discussion on the realized volatility, the way
of choosing the weight distribution can help to improve the quality of the estimator.
For example, we will show that the IGARCH estimation can lead to an exponential
weight distribution which is more appropriate to estimate the realized volatility.
Close to close, open to close estimators

As discussed above, the volatility can be obtained by an using moving-average on
discrete ensemble data. The standard measurement is to employ the above result of
the canonical estimator for the closing prices (so-called “close to close” estimator):
n
X
1
2
σ̂CC = ((oti + cti ) − (o + c))2
(n − 1) T
i=1
Here, T is the time period corresponding to 1 trading day. In the rest of the paper,
we user CC to denote the close to close estimator. We remark that in this formula,
there are two different points in comparison to the one defined above. Firstly, we
have subtracted the mean value of the closing price (o + c) in order to eliminate the
drift effect:
n n
1 X 1 X
o= oti , c = cti
nT nT
i=1 i=1
Secondly, the prefactor is now 1/ (n − 1) T but not 1/nT . In fact, we have subtracted
the mean value then maximum likehood procedure leads to the factor 1/ (n − 1) T .
We can define also two other volatility estimators which is “open to close” estimator
(OC):
Xn
1
2
σ̂C = (cti − c)2
(n − 1) T
i=1
and the “close to open” estimator (CO):
n
X
1
2
σ̂O = (oti − o)2
(n − 1) T
i=1
We remind that oti is the opening jump for a given trading period, cti is the daily
movement of the asset price such that the close to close return is equal to (o + c).
We remark that the “close to close ” estimator does not depend on the drift and the
closing interval f . Without presence of the microstructure noise, this estimator is
unbiased. Hence, it is usually used as a benchmark to judge the efficiency of other
estimators σ̂ which is defined as:

2
var σ̂CC 2
eff σ̂ =
var (σ̂ 2 )

where var σ̂ 2 = 2σ
4 /n. The quality of an estimator is determined by its high value
of efficiency eff σ̂ 2 > 1.
25
2.2.3 High-low estimators
We have seen that the daily deviation can be used to define the estimator of the
volatility. It comes from the fact that one has assumed that the logarithm of price
follows a Brownian motion. We all know that the standard
√ deviation in the diffusive
process over an interval time ∆t is proportional to σ ∆t , hence using the variance
to estimate the volatility is quite intuitive. Indeed, within a given time interval, if
additional information of the price movement is available such as the highest value or
the lowest value, this range must provide as well a good measure of the volatility. This
idea is first addressed by W. Feller in 1951. Later, Parkinson (1980) has employed
the first result of Feller’s work to provide the first “high-low” estimator (so-called
Parkinson estimator). If one uses close prices to estimate the volatility, one can
eliminate the effect of the drift by subtracting the mean value of daily variation.
By contrast, the use of high and low prices can not eliminate the drift effect in
such a simple way. In addition, the high and low prices can be only observed in the
opening interval, then it can not eliminate the second effect due to the opening jump.
However, as demonstrated in the work of Parkinson (1980), this estimator gives a
better confidence but it obviously underestimate the volatility because of the discrete
observation of the price. The maximum and minimum value over a time interval ∆t
are not the true ones of the Brownian motion. They are underestimated then it
is not surprising that the result will depend strongly on the frequency of the price
quotation. In the high frequency market, the third effect can be negligible however
we will discuss this effect in the later. Because of the limitation of Parkinson’s
estimator, an other estimator which is also based on the work of Feller was proposed
by Kunitomo (1992). In order to eliminate the drift, he construct a Brownian bridge
then the deviation of this motion is again related to the diffusion coefficient. In the
same line of thought, Rogers and Satchell (1991) propose an other use of high and
low prices in order to obtain a drift-independent volatility estimator. In this section,
we review the three techniques which are always constrained by the opening jump.
The Parkinson estimator
Let us consider the random variable uti − dti (namely the range of the Brownian
motion over the period [ti , ti+1 [), then the Parkinson estimator is defined by using
the following result (Feller 1951):
h i
E (u − d)2 = (4 ln 2) σ 2 T
By inversing this formula, we obtain a natural estimator of volatility based on high

and low prices. The Parkinson’s volatility estimator is then defined as (Parkinson
1980):
n
1 X 1
σ̂P2 = (uti − dti )2
nT 4 ln 2
i=1
26
In order to estimate the error of the estimator, we compute the variance of σ̂P2 which
is given by the following expression:
4
2
9ζ (3) σ
var σ̂P = 2 −1
16 (ln 2) n
Here, ζ (x) is the Riemann function. In comparison to the benchmark estimator

“close to close” , we have an efficiency:
32 (ln 2)2
eff σ̂P2 = = 4.91
9ζ (3) − 16 (ln 2)2
The Garman-Klass estimator

Another idea employing the additional information from the high and low value of the
price movement within the trading day in order to increase the estimator efficiency
was introduced by Garman and Klass (1980). They construct a best analytic scale
estimator by proposing a quadratic form estimator and imposing the well-known in-
variance condition of Brownian motion on the set of variable (u, d, c). By minimizing
its variance, they obtain the optimal variational form of quadratic estimator which
is given by the following property:
h i
E 0.511 (u − d)2 − 0.019 (c (u + d) − 2ud) − 0.383c2 = σ 2 T
Then the Garman-Klass estimator is defined as:
1 Xh i
n
2
σ̂GK = 0.511 (uti − dti )2 − 0.019 (cti (uti + dti ) − 2uti dti ) − 0.383c2ti
nT
i=1

The minimal value of the variance corresponding
to the quadratic estimator is var σ 2
GK =
0.27σ 4 /n and its efficiency is now eff σGK
2 = 7.4.
The Kunitomo estimator

Let Xt the logarithm of price process Xt = ln St , the Ito theorem gives us its
evolution:
σt2
dXt = µt − dt + σt dBt
2
If the drift term becomes relevant in the estimation of volatility, one can eliminate
it by constructing a Brownian bridge on the period T as following:
t
Wt = Xt − XT
T
If the initial condition is normalized to X0 = 0, then by definition we always have
XT = 0. This construction eliminates automatically the drift term when its daily
variation is small µti+1 − µti µti . We define the range of the Brownian bridge
27
Dti = Mti − mti where Mti = maxt∈[ti ,ti+1 [ Wt and mti = mint∈[ti ,ti+1 [ Wt . It has
been demonstrated that the variance of the range of Brownian bridge is directly
proportional to the volatility (Feller 1951):

E D2 = T π 2 σ 2 /6 (2.4)
Hence, Kunimoto’s estimator is defined as following:

n
1 X 6
2
σ̂K = (Mti − mti )2
nT π2
i=1
Higher moment of the Brownian bridge can be also calculated analytically and is
given by the formula 2.10 in Kunitomo (1992).
In particular, the variance of the
Kunitomo’s estimator is equal to var σK
2 = σ 4 /5n which implies the efficiency of
this estimator eff σK
2 = 10.
The Rogers-Satchell estimator

Another way to eliminate the drift effect is proposed by Rogers and Satchell. They
consider the following property of the Brownian motion:
E [u (u − c) + d (d − c)] = σ 2 T
This expectation value does not depend on the drift of the Brownian motion, hence
it does provide a drift-independent estimator which can be defined as:
n
2 1 X
σ̂RS = [uti (uti − cti ) + dti (dti − cti )]
nT
i=1

The variance of this estimator is given by var σ̂RS
2 = 0.331σ 4 /n which gives an
efficiency eff σ̂RS
2 = 6.
Like the other techniques based on the range ”high-low”, this estimator underes-
timates the volatility due to the fact that the maximum of a discretized Brownian
motion is smaller than the true value. Rogers and Satchell have also proposed a cor-
rection scheme which can be generalized for other technique. Let M be the number
of quoted price, then h = T /M is the step of the discretization, then the corrected
estimator taking account of the finite step error is give by the root of the following
equation: √
σ̂h2 = 2bhσ̂h2 + 2 (u − d) a hσ̂h + σ̂RS
2
√ √
where a = 2π 1/4 − 2 − 1 /6 and b = (1 + 3π/4) /12.
2.2.4 How to eliminate both drift and opening effects?

A common way to eliminate both effects coming from the drift and the opening
jump is to combine the various available volatility estimators. The general scheme
28
is to form a linear combination of opening estimator σO and close estimator σC or

a high-low estimator σHL . The coefficients of this combination are determined by a
minimization procedure on the variance of the result estimator. Given the faction
of closing interval f , we can improve all high-low estimators discussed above by
introducing the combination:
2
σ̂O σ̂ 2
σ̂ 2 = α + (1 − α) HL
f 1−f
Here, the trivial choice is α = f and the estimator becomes independent of the
opening jump. However, the optimal value of the coefficient is chosen as α = 0.17
for Parkinson and Kunimoto estimators whereas it value is α = 0.12 for Garman-
Klass estimator (Garman and Klass 1980). This technique can eliminate the effect
of the opening jump for all estimator but only Kunimoto estimator can avoid both
effects.
Applying the same idea, Yang and Zhang (2000) have proposed another combi-
nation which can also eliminate both effect as Kunimoto estimator. They choose the
following combination:
2
σ̂O 1−α
σ̂Y2 Z = α + 2
κσ̂C 2
+ (1 − κ) σ̂HL
f 1−f
In the work of Yang ans Zhang, they have used σ̂RS 2 as high-low estimator because
it is drift independent estimator. The coefficient α will be chosen as α = f and κ

is given by optimizing the variance of estimator. The minimization procedure gives
the optimal value of the parameter κ:
β−1
κo =
β + n+1
n−1
h i
where β = E (u (u − c) + d (d − c))2 /σ 4 (1 − f )2 . As the numerator is proportional
to (1 − f )2 , β is in dependent of f . Indeed, the value of β varies not too much (from
1.331 to 1.5) when the drift is changed. In practice, the value of β is chosen as 1.34.
2.2.5 Numerical simulations

Simulation with constant volatility
We test various volatility estimators via a simulation of a geometric Brownian motion
with constant annualized drift µ = 30% and constant annualized volatility σ = 15%.
We realize the simulation based on N = 1000 trading days with M = 50 or 500
intra-day observations in order to illustrate the effect of the discrete price on the
family of high-low estimators.
• Effect of the discretization

We first test the effect of the discretization on the various estimators. Here,
29
we take M = 50 or 500 intraday observations with µ = 0 and f = 0. In Figure

2.2, we present the simulation results for M = 50 price quotation in a trad-
ing day. All the high-low estimators are weakly biased due the discretization
effect. They all underestimate the volatility as the range of estimator is small
than the true range of Brownian motion. We remark that the close-to-close is
unbiased however its variance is too large. The correction scheme proposed by
Roger and Satchell can eliminate the discretization effect. When the number
of observation is larger, the discretization effect is negligible and all estimators
are unbiased (see Figure 2.3).
Figure 2.2: Volatility estimators without drift and opening effects (M = 50)
20
Simulated σ, CC, OC, P, K, GK, RS, RS−h, YZ
19
18
17
16
σ (%)
15
14
13
12
11
10
0 100 200 300 400 500 600 700 800 900 1000
• Effect of the non-zero drift

We consider now the case with non-zero annual drift µ = 30%. Here, we take
M = 500 intraday observations. In Figure 2.4, we observe that the Parkinson
estimator and the Garman-Klass estimator are strongly dependent on the drift
of Brownian motion. Kunimoto estimator and Rogers-Satchell estimator are
not dependent on the drift.
• Effect of the opening jump
For the effect of the opening jump, we simulate data with f = 0.3. In Figure
2.4, we take M = 500 intraday observations with zero drift µ = 0. We observe
that with the effect of the opening jump, all high-low estimator underestimate
the volatility except for the YZ estimator. By using the combination between
the open volatility estimator σ̂O2 with the other estimators, the effect of the
opening can be completely eliminated (see Figure 2.6).
30
Figure 2.3: Volatility estimators without drift and opening effect (M = 500)
20
19
18
17
16
σ (%)
15
14
13
12
11
10
0 100 200 300 400 500 600 700 800 900 1000
Figure 2.4: Volatility estimators with µ = 30% and without opening effect (M = 500)

26
24
22
σ (%)
20
18
16
14
12
0 100 200 300 400 500 600 700 800 900 1000
31
Figure 2.5: Volatility estimators with opening effect f = 0.3 and without drift
(M = 500)
20
19
18
17
16
15
σ (%)
14
13
12
11
10
9
0 100 200 300 400 500 600 700 800 900 1000
Figure 2.6: Volatility estimators with correction of the opening jump (f = 0.3)
32
Simulation with stochastic volatility

We consider now the simulation with stochastic volatility which is described by the
following model:
dSt = µt St dt + σt St dBt
(2.5)
dσt2 = ξσt2 dBtσ
in which Btσ is a Brownian motion independent to the one of asset process.
We will first estimate the volatility with all the proposed estimators then verify
the quality of these estimators via a backtest using the voltarget strategy2 . For
the simulation of the volatility, we take the same parameters as above with f = 0,
µ = 0, N = 5000, M = 500, ξ = 0.01 and σ0 = 0.4. In Figure 2.7, we present the
result corresponding to different estimators. We remark that the group of high-low
estimators gives a better result for volatility estimation. We can estimate the error
Figure 2.7: Volatility estimators on stochastic volatility simulation
55
50
45
40
35
σ (%)
30
25
20
15
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
committed for each estimator by the following formula:

N
X
= (σ̂t − σt )2
t=1
The errors obtained for various estimators are summarized in the below Table 2.1.
We now apply the estimation of the volatility to perform the voltarget strate-
gies. The result of the this test is presented in Figure 2.8. In order to control the
2
The detail description of voltarget strategy is presented in Section Backtest
33
Table 2.1: Estimation error for various estimators

Estimator 2
σ̂CC σ̂P2 2
σ̂K 2
σ̂GK 2
σ̂RS σ̂Y2 Z
PN 2
t=1 (σ̂ − σ) 0.135 0.072 0.063 0.08 0.076 0.065
quality of the voltarget strategy, we compute the volatility of the voltarget strategy
obtained by each estimator. We remark that the calculation of the volatility on
the voltarget strategies is effectuated by the close-to-close estimator with the same
averaging window of 3 months (or 65 trading days). The result is reported in Fig-
ure 2.9. As shown in the figure, all estimators give more and less the same results.
If we compute the error committed by these estimators, we obtain CC = 0.9491,
P = 1.0331, K = 0.9491, GK = 1.2344, RS = 1.2703, Y Z = 1.1383. This result
may comes form the fact that we have used the close-to-close estimator to calculate
the volatility of all voltarget strategies. Hence, we consider another check of the
Figure 2.8: Test of voltarget strategy with stochastic volatility simulation
2.6
Benchmark, CC, OC, P, GK, RS, YZ
2.4
2.2
1.8
1.6
1.4
1.2
0.8
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
t
estimation quality. We compute the realized return of the voltarget strategies:
RV (ti ) = ln Vti − ln Vti−1
where Vti is the wealth of the voltarget portfolio. We expect that this quantity
follows a Gaussian probability distribution with volatility σ ? = 15%. Figure 2.10
shows the probability density function (Pdf) of the realized returns corresponding
to all considered estimators. In order to have a more visible result, we compute the
different between the cumulative distribution function (Cdf) of each estimator and
34
Figure 2.9: Test of voltarget strategy with stochastic volatility simulation
25
CC, OC, P, K, GK, RS, YZ
20
σ (%)
15
10
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
the expected Cdf (see Figure 2.11). Both results confirm that the Parkinson and the
Kunitomo estimators improve the quality of the volatility estimation.
2.2.6 Backtest
Volatility estimations of S&P 500 index
We now employ the estimators discussed above for the S&P 500 index. Here, we
do not have all tick-by-tick intraday data, hence the Kunimoto’s estimator and the
Rogers-Satchell correction can not be applied.
We remark that the effect of the drift is almost negligible which is confirmed
by Parkinson and Garman-Klass estimators. The spontaneous opening jump is esti-
mated simply by:
2 !
σ̂C
ft = 1 +
σ̂O
We then employ the exponential-average technique to obtain a filter of this quantity.

We obtain the average value of closing interval over the considered data for S&P 500
f¯ = 0.015 and for BBVA SQ Equity f¯ = 0.21. In the following, we use different
estimators in order to extract the signal ft . The trivial one is using ft as the predic-
tion of the opening jump, we denote fˆt , then we contruct the habitual ones like the
moving-average fˆma , the exponential moving-average fêxp and the cumulated aver-
age fˆc . In Figure 2.15, we show result corresponding to different filtered f on the
35
Figure 2.10: Comparison between different probability density functions
45
Expected Pdf, CC, OC, P, K, GK, RS, YZ
40
35
30
25
Pdf
20
15
10
0
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05
RV
Figure 2.11: Comparison between the different cumulative distribution functions
CC, OC, P, K, GK, RS, YZ

0.07
0.06
0.05
0.04
0.03
∆Cdf
0.02
0.01
−0.01
−0.02
−0.06 −0.04 −0.02 0 0.02 0.04 0.06

RV
36
Figure 2.12: Volatility estimators on S& P 500 index
100 CO, CC, OC, P, GK, RS, YZ
90
80
70
60
σ (%)
50
40
30
20
10
01/2001 01/2003 01/2005 01/2007 01/2009 01/20011
Figure 2.13: Volatility estimators on on BHI UN Equity
80 CO, CC, OC, P, GK, RS, YZ
70
60
50
σ (%)
40
30
20
10
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
37
Figure 2.14: Estimation of the closing interval for S&P 500 index
0.15
Realized closing ratio
Moving average
Exponential average
Cummulated average
Average
0.1
f
0.05
0
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
Figure 2.15: Estimation of the closing interval for BHI UN Equity
Realized closing ratio

Moving average
0.8 Exponential average
Cummulated average
Average
0.6
f
0.4
0.2
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
38
BHI UN Equity data. Figure 2.13 shows that the family of high-low estimator give
a better result than the calissical close-to-close estimator. In order to check the qual-
ity of these estimators on the prediction of the volatility, we checke the value of the
“Likehood” function corresponding to each estimator. Assuming that the observable
signal follows the Gaussian distribution, the likehood function is defined as:
n n
n 1X 2 1 X Ri+1 2
l(σ) = − ln 2π − ln σi −
2 2 2 σi
i=1 i=1
where R is the future realized return. In Figure 2.17, we present the result of the
likehood function for different estimators. This function reaches its maximal value
for the ‘Roger-Satchell’ estimator.
Figure 2.16: Likehood function for various estimators on S&P 500
4
x 10
1.98
1.97
1.96
1.95
1.94
CC OC P GK RS YZ
Backtest on voltarget strategy

We now backtest the efficiency of various volatility estimators with vol-target strategy
on S&P 500 index and an individual stock. Within the vol-target strategy, the
exposition to the risky asset is determined by the following expression:
σ?
αt =
σ̂t
where σ ? is the expected volatility of the strategy and σ̂t is the prediction of the
volatility given by the estimators above. In the backtest, we take the annualized
volatility σ ? = 15% with historical data since 01/01/2001 to 31/12/2011. We present
the results for two cases:
39
Figure 2.17: Likehood function for various estimators on BHI UN Equity
4
x 10
1.806
1.804
1.802
1.8
1.798
1.796
1.794
CC OC P GK RS YZ
• Backtest on S&P 500 index with moving-average equal to 1 month (n = 21)

of historical data. We remark in this case that the volatility of the index is
small then the error on the volatility estimation causes less effect. However, the
high-low estimators suffer the effect of discretization then they underestimate
the volatility. For the index, this effect is more important therefore the close-
to-close estimator gives the best performance.
• Backtest on single asset with moving-average equal to 1 month (n = 21) of

historical data. In the case with a particular asset such as the BBVA SQ
Equity, the volatility is important hence the error due the efficiency of volatility
estimators are important. High-low estimators now give better results than the
classical one.
In order to illustrate the efficiency of the range-based estimators, we realize a

ranking between high-low estimator and the benchmark estimator close-to-close. We
apply the voltarget strategy for close-to-close estimator σ̂CC
2 and a high-low esti-
mator σ̂HL . Then we compare the Sharpe ratio obtained by these two estimators
2
and compute the number of times where the high-low estimator gives better perfor-
mance over the ensemble of stocks. The result over S&P 500 index and its first 100
compositions is summarized in Table 2.3.
40
Figure 2.18: Backtest of voltarget strategy on S&P 500 index
1.3
S&P 500, CC, OC, P, GK, RS, YZ
1.2
1.1
0.9
0.8
0.7
0.6
0.5
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
Figure 2.19: Backtest of voltarget strategy on BHI UN Equity
3
Benchmark, CC, OC, P, GK, RS, YZ
2.5
1.5
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
41
Table 2.2: Performance of σ̂HL

2 versus σ̂CC
2 for different averaging windows
Estimator σ̂P2 2
σ̂GK 2
σ̂RS σ̂Y2 Z
6 month 56.2% 52.8% 52.8% 57.3%
3 month 52.8% 49.4% 51.7% 53.9%
2 month 60.7% 60.7% 60.7% 56.2%
1 month 65.2% 64.0% 64.0% 64.0%
Table 2.3: Performance of σ̂H

2 L versus σ̂ 2
CC for different filters of f
Estimator σ̂P2 2
σ̂GK 2
σ̂RS σ̂Y2 Z
fˆc 65.2% 64.0% 64.0% 64.0%
fˆma 64.0% 61.8% 61.8% 64.0%
fêxp 64.0% 61.8% 60.7% 64.0%
fˆt 64.0% 61.8% 60.7% 64.0%
2.3 Estimation of realized volatility

The common way to estimate the realized volatility is to estimate the expectation
value of the variance over an observed windows. Then we compute the corresponding
volatility. However, to do so we encounter a great dilemma: taking a long historical
window can help to decrease the estimation error as discussed in the last paragraph
or taking a short historical data allows an estimation of volatility closer to the present
volatility.
In order to overcome this dilemma, we need to have an idea about the dynamics
of the variance σt2 that we would like to measure. Combining this knowledge on the
dynamics of σt2 with the committed error on the long historical window, we can find
out an optimal windows for the volatility estimator. We assume that the variance
follows a simplified dynamics which has been used in the last numerical simulation:

dSt = µt St dt + σt St dBt
dσt2 = ξσt2 dBtσ
in which Btσ is a Brownian motion independent to the one of asset process.
2.3.1 Moving-average estimator

In this section, we show how the optimal window of the moving-average estimator is
obtained via a simple example. Let us consider the canonical estimator:
n
2 1 X 2
σ̂ = Rti
nT
i=1
42
Here, the time increment is chosen to be constant ti − ti−1 = T , then the variance
of this estimator at instant tn is:
2
2σt4n T 2σt4n
var σ̂ ≈ =
tn − t0 n
On another hand, σt2 is now itself a stochastic process, hence its conditional variance
to σt2n gives us the error due to the use of historical observations. We rewrite:
Z tn Z tn
1 2 2 1
σ dt = σtn − (t − t0 ) σt2 ξ dBtσ
tn − t0 t0 t tn − t0 t0
then the error due to the stochastic volatility is given by:
Z tn
1 2 tn − t0 4 2 nT σt4n ξ 2
var σt dt σtn
2
≈ σtn ξ =
tn − t0 t0 3 3
The total error of the canonical estimator is simply the sum of these errors due to
the fact that the two considered Brownian motions are supposed to be independent.
We define the function of total estimation errors as following:
2σ 4 nT σt4n ξ 2
e σ̂ 2 = tn +
n 3
In order to obtain the optimal window for volatility estimation, we minimize the
error function e σ̂ 2 with respect to nT which leads to the following equation:
σt4n ξ 2 2σt4n
− 2 = 0
3 n T
√
This equation provides
p a very simple solution nT = 6T /ξ with the optimal error
is now e σ̂opt
2 ≈ 2 2T /3 σt4n ξ. The major difficulty of this estimator is to calibrate
the parameter ξ which is not trivial because ξt2 is an unobservable process. Different
techniques can be considered such as the maximum likehood which will be discussed
later.
2.3.2 IGARCH estimator

We discuss now another approach for estimating the realized volatility based on the
IGARCH model. The detail theoretical derivation of the method is given in Drost
F.C. et al. (1993, 1999) It consists of a volatility estimator of the form:
1−β 2
σ̂t2 = β σ̂t−T
2
+ Rt
T
where T is a constant increment of estimation . In integrating the recurrence relation
above, we obtain the estimator of the variance IGARCH in function of the return
observed in the past:
n
1−β X i 2
σ̂t2 = β Rt−iT + β n σ̂t−nT
2
(2.6)
T
i=1
43
We remark that the contribution of the last term tends to 0 when n tends to infinity.
This estimator again has the form of a weighted average then similar approach as
in the canonical estimator is applicable. Assuming that the volatility follows the
lognormal dynamics described by Equation 2.3, therefore the optimal value of β is
given by:
p
ξ 8T − ξ 2 T 2 − 4
β? = (2.7)
ξ2T − 4
We encounter here again the same question as the canonical case that is how to
calibrate the parameter ξ of the lognormal dynamics. In practice, we proceed in the
inverse way. We seek first the optimal value β ? of the IGARCH estimator then use
the inverse relation of equation 2.7 to determine the value of ξ:
s
4 (1 − β ? )2
ξ=
T 1 + β ?2
Remark 1 Finally, as insisted at the beginning of this discussion, we would like

to point out that IGARCH estimator can be considered as an exponential weighted
average. We begin first with a IGARCH estimator with constant time increment.
The expectation value of this estimator is:
" #
1 − β
+∞
X

E σ̂t2 σ = E β i Rt−iT
2
σ
T
i=1
+∞ Z
1 − β X i t−iT +T 2
= β σu du
T t−iT
i=1
+∞
X Z t−iT +T
1 i
= P+∞ β σu2 du
T β i
i=1 i=1 t−iT
+∞
X Z t−iT +T
1 iT λ
= P+∞ e σu2 du
iT λ
i=1 T e i=1 t−iT
with λ = − ln β/T . In this present form, we conclude that the IGARCH estimator is
a weighted-average of the variance σt2 with an exponential weight distribution. The
annualized estimator of the volatility can be written as:
P+∞ R t−iT +T 2
i=1 e−iT λ t−iT σu du
E σ̂t2 σ = P+∞ −iT λ
i=1 T e
This expression admits a continuous limit when T → 0 .
44
2.3.3 Extension to range-based estimators

The estimation of the optimal window in the last discussion can be also generalized to
the case of range-based estimators. The main idea is to obtain the trade-off between
the estimator error (variance of the estimator) and the dynamic volatility described
by the model (2.3). The equation that determines the total error of the estimator is
given by:
nT 4 2
e(σ̂ 2 ) = var σ̂ 2 + σ ξ
3 tn
Here, we remind that the first term in this expression is the estimator error coming
from the discrete sum whereas the second term is the error of the stochastic volatility.
In fact, the first term is already given by the study of various estimators in the last
section. The second term is typically dependent on the choice of volatility dynamics.
Using the notation of the estimator efficiency, we rewrite the above expression as:
1 2σt4n nT 4 2
e(σ̂ 2 ) = + σ ξ
eff (σ̂ 2 ) n 3 tn
The minimization procedure of the total error is exactly the same as the last exam-
ple on the canonical estimator, then we obtain the following result of the optimal
averaging window: s
6T
nT = (2.8)
eff (σ̂ 2 ) ξ 2
The IGARCH estimator can also be applied for various type of high-low esti-
mator, the extension consists of performing an exponential moving average in stead
of the simple average. The parameter of the exponential moving average β will
be determined again by the maximum likehood method as shown in the discussion
below.
2.3.4 Calibration procedure of the estimators of realized volatility

As discussed above, the estimators of realized volatility depend on the choice of the
underlying dynamics of the volatility. In order to obtain the best estimation of the
realized volatility, we must estimate the parameter which characterizes this dynamics.
Two possible approaches to obtain the optimal value of the these estimators are:
• using the least square problem which consists to minimize the following objec-
tive function:
Xn
2
Rt2i +T − T σ̂t2i
i=1
• or using the maximum likehood problem which consists to maximize the log-
likehood objective function:
X1 n n
n X Rt2i +T
− ln 2π − ln T σ̂t2 −
2 2
i=0
2T σ̂t2i
i=0
45
We remark here that the moving-average estimator depends only on the averaging
window whereas the IGARCH estimator depends only on the parameter β. In gen-
eral, there is no way to compare these two estimators if we do not use a specific
dynamics. By this way, the optimal values of both parameters are obtained by the
optimal value of ξ and that offers a direct comparison between the quality of these
two estimators.
Example of realized volatility

We illustrate here how the realized volatility is computed by the two methods dis-
cussed above. In order to illustrate how the optimal value of the averaging window
nT or β ? are calibrated, we plot the likehood functions of these two estimator for
one value of volatility at a given date. In Figure 2.20, we present the logarithm
of likehood functions for different value of ξ. The maximal value of the function
l(ξ) gives us the optimal value ξ ? which will be used to evaluate the volatility for
the two methods. We remark that the IGARCH estimator is better to estimate the
global maximum because its logarithm likehood is a concave function. For the the
moving-average method, its logarithm likehood function is not smooth and presents
complicated structure with local maximums which is less efficient for the optimization
procedure.
Figure 2.20: Comparison between IGARCH estimator and CC estimator
1720
CC optimal
IGARCH
1715
1710
1705
1700
l(ξ)
1695
1690
1685
1680
1675
1670
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
ξ
We now test the implementation of IGARCH estimators for various high-low

estimators. As we have demonstrated that the IGARCH estimator is equivalent to
46
exponential moving-average, then the implementation for high-low estimators can be

set up in the same way as the case of close-to-close estimator. In order to determine
the optimal parameter β ? , we perform an optimization scheme on the logarithm like-
hood function. In Figure 2.21, we present the comparison of the logarithm likehood
function between different estimators in function of the parameter β. The optimal
parameter β ? of each estimator corresponds to the maximum of the logarithm like-
hood function. In order to have a clear idea about the corresponding size of the
Figure 2.21: Likehood function of high-low estimators versus filtered parameter β
1490 CC, OC, P, GK, RS, YZ
1485
1480
1475
1470
l(β)
1465
1460
1455
1450
1445
1440
0.7 0.75 0.8 0.85 0.9 0.95 1
β
moving-average window to the optimal parameter β ? , we use the formula (2.7) to

effectuate the conversion. The result is reported in the Figure 2.22 below.
Backtest on the voltarget strategy

We take historical data of S&P 500 index over the period since 01/2001 to 12/2011
and the averaging window of the close-to-close estimator is chosen as n = 25. In
Figure2.23, we show the different estimations of realized volatility.
In order to test the efficiency of these realized estimators (moving-average and
IGARCH), we first evaluate the likehood function for the close-to-close estimator
and realized estimators then apply these estimators for the voltarget strategy as
performed in the last section. In Figure 2.25, we present the value of likehood
function over the period from 01/2001 to 12/2010 for three estimators: CC, CC
optimal (moving-average) and IGARCH. The estimator corresponding to the highest
value of the likehood function is the one that gives the best prediction of the volatility.
47
Figure 2.22: Likehood function of high-low estimators versus effective moving window
1485
CC, OC, P, GK, RS, YZ
1480
1475
1470
1465
l(n)
1460
1455
1450
1445
1440
0 10 20 30 40 50 60 70 80 90
n
Figure 2.23: IGARCH estimator versus moving-average estimator for close-to-close

prices
100
CC
CC optimal
IGARCH
80
60
σ (%)
40
20
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
48
Figure 2.24: Comparison between different IGARCH estimators for high-low prices
90 CC, CO, P, GK, RS, YZ
80
70
60
σ (%)
50
40
30
20
10
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
Figure 2.25: Daily estimation of the likehood function for various close-to-close esti-
mators
1900
CC
CC optimal
CC IGARCH
1800
1700
1600
l(σ̂)
1500
1400
1300
01/2001 01/2003 01/2005 01/2007 01/2009 01/20011
49
Figure 2.26: Daily estimation of the likehood function for various high-low estimators
1900
CC, OC, P, GK, RS, YZ
1800
1700
1600
l(σ̂)
1500
1400
1300
1200
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
In Figure 2.27, the result of the backtest on voltarget strategy is performed for
the three considered estimators. The estimators which dynamical choice of averaging
parameters always give better result than a simple close-to-close estimator with fixed
averaging window n = 25. We next backtest on the IGARCH estimator applied on
the high-low price data, the comparison with IGARCH applied on close-to-close data
is shown in Figure 2.28. We observe that the IGARCH estimator for close-to-close
price is one of the estimators which produce the best backtest.
2.4 High-frequency volatility estimators

We have discussed in the previous sections how to measure the daily volatility based
on the range of the observed prices. If more information is available in the trading
data like having all the real-time quotation, can one estimate more accurately the
volatility? As far as the trading frequency increases, we expect that the precision of
estimator get better as well. However, when the trading frequency reaches certain
limit, new phenomenon coming from the non-equilibrium of the market emerges
and spoils the precision. This limit defines the optimal frequency for the classical
estimator. In the literature, it is more and less agree to be at the frequency of one
trade every 5 minutes. This phenomenon is called the micro-structure noise which
are characterized by the bid-ask spread or the transaction effect. In this section,
we will summarize and test some recent proposals which attempt to eliminate the
micro-structure noise.
50
Figure 2.27: Backtest for close-to-close estimator and realized estimators
S&P 500
1.4 CC
CC optimal
CC IGARCH
1.2
0.8
0.6
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
Figure 2.28: Backtest for IGARCH high-low estimators comparing to IGARCH close-
to-close estimator
1.4 S&P 500, CC, OC, P, GK, RS, YZ
1.2
0.8
0.6
01/2001 01/2003 01/2005 01/2007 01/2009 01/2011
51
2.4.1 Microstructure effect

It has been demonstrated in the financial literature that the realized return estimator
is not robust when the sampling frequency is too high. Two possible explanations
of this effect the following. In the probabilistic point of view, this phenomenon
comes from the fact that the cumulated return (or the logarithm of price) is not a
semimartingal as we assumed in the last section. However, it emerges only in the
short time scale when the trading frequency is high enough. In the financial point of
view, this effect is explained by the existence of the so-called market microstructure
noises. These noises come from the existence of the bid-ask spread. We now discuss
the simplest model which includes the mircrostruture noise as an independent noise
to the underlying Brownian motion. We assume that the true cumulated return is
an unobservable process and follows a Brownian motion:

σ2
dXt = µt − t dt + σt dBt
2
The observed signal Yt is the cumulated return which is perturbed by the microstruc-
ture noise t :
Yt = Xt + t
For the sake of symplicity, we use the following assumptions:

(i) ti is iid with E [ti ] = 0 and E 2ti = E 2
(ii) t ⊥
⊥ Bt
From these assumptions, we see immediately that the volatility estimator based on
historical data Yti is biased:

var(Y ) = var(X) + E 2

The first term var(X) is scaled as t (estimation horizon) and E 2 is constant,
this estimator
can be considered as unbiased if the time horizon is large enough
(t > E 2 /σ 2 ). At high frequency, the second term is not negligible and better
estimator must be able to eliminate this term.
2.4.2 Two time-scale volatility estimator

Using different time scales to extract the true volatility of the hidden price process
(without noise) is both independently proposed by Zhang et al. (2005) and Bandi
et al. (2004). In this paragraph, we employ the approach in the first reference to
define the intra-day volatility estimator. We prefer here discussing the main idea of
this method and its practical implementation rather than all the detail of stochastic
calculus concerning the expectation value and the variance of the realized return3 .
3
Detail of the derivation of this technique can be found in Zhang et al. (2005).
52
Definitions and notations

In order to fix the notations, let us consider a time-period [0, T ] which is divided in
to M − 1 intervals (M can be understood as the frequency). The quadratic variation
of the Bronian motion over this period is denoted:
Z T
hX, XiT = σt2 dt
0
For the discretized version of the quadratic variation, we employ the [., .] notation:
X 2
[X, X]T = Xti+1 − Xti
ti ,ti+1 ∈[0,T ]
Then the habitual estimator of realized return over the interval [0, T ] is given by:
X 2
[Y, Y ]T = Yti+1 − Yti
ti ,ti+1 ∈[0,T ]
We remark that the number of points in the interval [0, T ] can be changed. In fact,
the expectation value of the quadratic variation should not depend on the distribution
of points in this interval. Let us define the ensemble of points in one period as a grid
G:
G = {t0 , . . . , tM }
Then a subgrid H is defined as:
H = {tk1 , . . . , tkm }
where (tkj ) with j = 1, . . . m is a subsequence of (ti ) with i = 1, . . . M . The number

of increments is denoted as:
|H| = card (H) − 1
With these notations, the quadratic variation over a subgrid H reads:
X 2
[Y, Y ]H
T = Ytki+1
− Ytki
tki ,tki+1 ∈H
The realized volatility estimator over the full grid

If we compute the quadratic variation over the full grid G which means that at highest
frequency. As discussed above, it is not surprising that it will suffer the most effect
of the microstructure noise:
[Y, Y ]GT = [X, X]GT + 2 [X, ]GT + 2 [, ]GT
Under the hypothesis of the microstructure noise, the conditional expectation value
of this estimator is equal to:
h i

E [Y, Y ]GT X = [X, X]GT + 2M E 2
53
and the variation of the estimator:

G
var [Y, Y ]T X = 4M E 4 + 8 [X, X]GT E 2 − 2var 2 + O(n−1/2 )
In these two expressions above, the sums are arranged order by order. In the limit
M → ∞, we obtain the habitual result of central limit theorem:
L 1/2
M −1/2 [Y, Y ]GT − 2M E 2 −→ 2 E 4 N (0, 1)
Hence, as M increases, [Y, Y ]GT becomes a good estimator of the microstructure noise
and we denote:
1
E[[2 ] = [Y, Y ]GT
2M
The central limit theorem for this estimator states:
L 1/2
M 1/2 E[ [2 ] − E 2 − → E 4 N (0, 1) as M → ∞
The realized volatility estimator over subgrid
As we mentioned in the last discussion, increasing the frequency will spoil the esti-
mation of the volatility due to the presence of the microstructure noise. The naive
solution is to reduce the number of point in the grid or to consider only a subgrid,
then one can take the average over a number choice of subgrids. Let us consider a
subgrid H with |H| = m − 1, then the same result as for the full grid can be obtained
in replacing M by m:
h i 2
H
E [Y, Y ]T X = [X, X]H T + 2mE
Let us
SKnow (k)
consider a sequence of subgrids H(k) with k = 1 . . . K which satisfies
G = k=1 H and H(k) ∩ H(l) = ∅ with k 6= l. By averaging over these K subgrid,
we obtain the result:
h i K
1 X
avg (k)
E [Y, Y ]T X = [Y, Y ]H
T
K
k=1
P
We define the average length of the subgrid m = (1/K) Kk=1 mk , then the final
expression is: i
h 2

E [Y, Y ]avg avg
T X = [X, X]T + 2mE
This estimator of volatility is still biased and the precision depends strongly on the
choice of the length of subgrid and the number of subgrids. In the paper of Zhang et
al., the authors have demonstrated that there exists an optimal value K ? for which
we can reach the best performance of estimator.
54
Two time-scale estimator

As the full-grid averaging estimator and the subgrid averaging estimator both contain
the same component coming from the microstructure noise to a factor, we can employ
both estimators to have a new one where the microstructure noise can be completely
eliminated. Let us consider the following estimator:

2 m −1 avg m G
σ̂ts = 1 − [Y, Y ]T − [Y, Y ]T
M M
This estimator now is an unbiased estimator with its precision determined by the
choice of K and m. In the theoretical framework, this optimal value is given as a
function of the noise variance and the forth moment of the volatility. In practice, we
employ a scan over the number of the subgrid of size m ∝ M/K in order to look for
the optimal estimator.
2.4.3 Numerical implementation and backtesting

We now backtest the proposed technique on the S&P 500 index with the choice of
the sub grid as following. The full grid is defined by the ensemble of data every
minute from the opening to the close of trading days (9h to 17h30). Data is taken
since the 1st February 2011 to the 6th June 2011. We denote the full grid for each
trading day period:
G = {t0 , . . . , tM }
and the subgrid is chosen as following:
H(k) = {tk−1 , tk−1+K . . . , tk−1+nk K }
where the indice k = 1, . . . , K and nk is the integer making tk−1+nk K the last element
in H(k) . As we can not compute exactly the value of the optimal value K ? for each
trading period, we employ an iterative scheme which tends to converge to the optimal
value. Analytical expression of K ? is given by Zhang et al.:
2 !1/3
12 E 2
K? = M 2/3
T Eη 2
where η is given by the expression:

Z T
2
η = σt4 dt
0
In the first approximation, we consider the case where the intraday volatility is
constant then the expression of η cans be simplified to η 2 = T σ 4 . In Figure 2.29, we
present the result of the intraday volatility which takes into account only the trading
day for the S&P 500 index under the assumption of constant volatility. The two-
time scale estimator reduces the effect of microstructure noise effect on the realized
volatility computed over the full grid.
55
Figure 2.29: Two-time scale estimator of intraday volatility
35
Volatility with full grid
Volatility with subgrid
30 Volatility with two scales
25
20
σ (%)
15
10
0
02/11 03/11 04/11 05/11 06/11
2.5 Conclusion
Voltarget strategies are efficient ways to control the risk for building trading strate-
gies. Hence, a good estimator of the volatility is essential from this perspective. In
this paper, we show that we can use the data rang to improve the forecasting of the
volatility of the market. The use of high and low prices is less important for the index
as it gives more and less the same result with traditional close-to-close estimator.
However, for independent stock with higher volatility level, the high-low estimators
improves the prediction of volatility. We consider several backtests on the S&P 500
index and obtain competing results with respect to the traditional moving-average
estimator of volatility.
Indeed, we consider a simple stochastic volatility model which permit to integrate
the dynamics of the volatility in the estimator. An optimization scheme via the
maximum likehood algorithm allows us to obtain dynamically the optimal averaging
window. We also compare these results for rang-based estimator with the well-
known IGARCH model. The comparison between the optimal value of the likehood
functions for various estimators gives us also a ranking of estimation error.
Finally, we studied the high frequency volatility estimator which is a very active
topic of financial mathematics. Using simple model proposed by Zhang et al, (2005),
we show that the microstructure noise can be eliminated by the two time scale
estimator.
56
Bibliography
[1] Bandi F. M. and Russell J. R. (2006), Saperating Microstructure Noise from

Volatility Journal of Financial Economics, 79, pp. 655-692.
[2] Drost F. C. and Nijman T. E. (1993), Temporal Aggregation of GARCH

Processes Econometrica, 61, pp. 909-927.
[3] Drost F. C. and Werker J. M. (1999), Closing the GARCH gap: Continuous
time GARCH modeling Journal of Econometrics, 74, pp. 31-57 .
[4] Feller W. (1951), The Asymptotic Distribution of the Range of Sums of Inde-
pendent Random Variables, Annals of Mathematical Statistics, 22, pp. 427-432.
[5] Garman M. B. and Klass M. J. (1980), On the estimation of security price

from historical data, Journal of Business, 53, pp. 67-78.
[6] Kunimoto N. (1992), Improving the Parkinson method of estimating security

price volatilities, Journal of Business, 65, pp. 295-302.
[7] Parkinson M. (1980), The extreme value method for estimating the variance
of the rate of return, Journal of Business, 53, pp. 61-65.
[8] Rogers L. C. G. and Satchell S. E. (1991), Estimating variance form high,

low and closing prices, Annals of Applied Probability 1, pp. 504-512.
[9] Yang D. and Zhang Q. (2000), Drift-Independent Volatility Estimation Based

on High, Low, Open and Close Prices, Journal of Business, 73, pp. 477-491.
[10] Zhang L., Mykland P. A. and Ait-Sahalia Y. (2005), A Tale of Two Time
Scales: Determining Integrated Volatility With Noisy High-Frequency Data
Journal of the American Statistical Association, 100(472), pp. 1394-1411.
57
Chapter 3
Support Vector Machine in

Finance
In this chapter, we review in the well-known machine learning technique so-called

support vector machine (SVM). This technique can be employed in different contexts
such as classification, regression or density estimation according to Vapnik [1998].
Within this paper, we would like first to give an overview on this method and its
numerical variation implementation, then bridge it to financial applications such as
the stock selection.
Keywords:Machine learning, Statistical learning, Support vector machine, regres-

sion, classification, stock selection.
3.1 Introduction
Support vector machine is an important part of the Statistical Learning Theory. It
was first introduced in the mid-90 by Boser et al., (1992) and contributes important
applications for various domains such as pattern recognition (for example: handwrit-
ten, digit, image), bioinformatic e.t.c. This technique can be employed in different
contexts such as classification, regression or density estimation according to Vapnik
[1998]. Recently, different applications in the financial field have been developed via
two main directions. The first one employs SVM as non-linear estimator in order
to forecast the market tendency or volatility. In this context, SVM is used as a re-
gression technique with feasible possibility for extension to non-linear case thank to
the kernel approach. The second direction consists of using SVM as a classification
technique which aims to elaborate the stock selection in the trading strategy (for ex-
ample long/short strategy). In this paper, we review the support vector machine and
its application in finance in both points of view. The literature of this recent field
is quite diversified and divergent with many approaches and different techniques.
We would like first to give an overview on the SVM from its basic construction to
all extensions including the multi classification problem. We next present different
numerical implementations, then bridge them to financial applications.
59
Support Vector Machine in Finance
This paper is organized as following. In Section 2, we remind the framework of

the support vector machine theory based on the approach proposed in O.Chapelle
(2002). We next work out various implementations of this technique from both both
primal and dual problems in Section 3. The extension of SVM to the case of multi
classification is discussed in Section 4. We finish with the introduction of SVM in
the financial domain via an example of stock selection in Sections 5 and 6.
3.2 Support vector machine at a glance

We attempt to give an overview on the support vector machine method in this section.
In order to introduce the basic idea of SVM, we start with the first discussion on the
classification method via the concept of hard margin an soft margin classification. As
the work pioneered by Vapnik and Chervonenkis (1971) has established a framework
for Statistical Learning Theory, so-called “VC theory ”, we would like to give a brief
introduction with basic notation and the important Vapnik-Chervonenkis theorem
for Empirical Risk Minimization principle (ERM). Extension of ERM to Vicinal Risk
Minimization (VRM) will be also discussed.
3.2.1 Basic ideas of SVM

We illustrate here the basic ideas of SVM as a classification method. The main
advantage of SVM is that it can be not only described very intuitively in the con-
text of linear classification but also extended in an intelligent way to the non-linear
case. Let us define the training data set consisting of pairs of “input/output” points
(xi , yi ), with 1 ≤ i ≤ n. Here the input vector xi belongs to some space X whereas
the output yi belongs to {−1, 1} in the case of bi-classification. The output yi is
used to identify the two possible classes.
Hard margin classification

The most simple idea of linear classification is to look at the whole set of input
{xi ⊂ X } and search the possible hyperplane which can separate the data in two
classes based on its label yi = ±1. Its consists of constructing a linear discriminant
function of the form:
h(x) = wT x + b
where the vector w is the weight vector and b is called the bias. The hyperplane is
defined by the following equation:
H = {x : h(x) = wT x + b = 0}
This hyperplane divides the space X into two regions: the region where the discrimi-
nant function has positive value and the region with negative value. The hyperplane
is the also called the decision boundary. The linear classification comes from the fact
that this boundary depends on the data in the linear way.
60
Figure 3.1: Geometric interpretation of the margin in a linear SVM.
We now define the notion of a margin. In Figure 3.1 (reprinted from Ben-Hur
A. et al., 2010), we give a geometric interpretation of the margin in a linear SVM.
Let x+ and x− be the closest points to the hyperplane from the positive side and
negative side. The cycle data points are defined as the support vectors which are
the closest points to the decision boundary (see Figure 3.1). The vector √ w is the
normal vector to the hyperplane and we denote its norm kwk = wT w and its
direction ŵ = w/kwk. We assume that x+ and x− are equidistant from the decision
boundary. They determine the margin from which the two classes of points of data
set D are separated:
1
mD (h) = ŵT (x+ − x− )
2
In the geometric consideration, this margin is just half of the distant between two
closest points from both sides of the hyperplane H projected in the direction ŵ.
Using the equations that define the relative positions of these points to the hyperplane
H:
h(x+ ) = wT x+ + b = a
h(x− ) = wT x− + b = −a
where a > 0 is some constant. As the normal vector w and the bias b are undeter-
mined quantity, we can simply divide them by a and renormalized all these equations.
This is equivalent to set a = 1 in the above expression and we finally get
1 1
mD (h) = ŵT (x+ − x− ) =
2 kwk
61
The basic idea of maximum margin classifier is to determine the hyperplane which
maximizes the margin. For a separable dataset, we can define the hard margin SVM
as the following optimization problem:
1
min kwk2 (3.1)
w,b 2

u.c. yi wT xi + b > 1 i = 1...n

Here, yi wT xi + b > 1 is just a compact way to express the relative position of two
classes of data points to the hyperplane H. In fact, we have wT xi + b > 1 for the
class yi = 1 and wT xi + b < −1 for the class yi = −1.
The historical approach to solve this quadratic program is to map the primal
problem to dual problem. We give here the main result while the detailed derivation
can be found in the Appendix C.1. Via KKT theorem, this approach gives us the
following optimized solution (w? , b? ):
n
X
w? = αi? yi xi
i=1
where α? = (α1? , . . . , αn? ) is the solution of the dual optimization problem with dual
variable α = (α1 , . . . , αn ) of dimension n:
n
X n
1 X
max αi − αi αj yi yj xTi xj
α 2
i=1 i,j=1
u.c. αi ≥ 0 i = 1...n
We remark that the above optimization problem is a quadratic program in the

vectorial space Rd with n linear inequality constraints. It may become meaningless
if it has no solution (the dataset is inseparable) or too many solutions (stability of
boundary decision on data). The questions on the existence of a solution in Prob-
lem 3.5 or on the sensibility of solution on dataset are very difficult. A quantitative
characterization can be found in the next discussion on the framework of Vapnik-
Chervonenskis theory. We will present here an intuitive view of this problem which
depends on two main factors. The first one is the dimension of the space of func-
tion h(x) which determines the decision boundary. In the linear case, it is simply
determined by the dimension of the couple (w, b). If the dimension of this function
space is two small as in the linear case, it is possible that there exists no linear so-
lution or the dataset can not be separated by a simple linear classifier. The second
factor is the number of data points which involves in the optimization program via
n inequality constraints. If the number of constraints is too large, the solution may
not exist neither. In order to overcome this problem we must increase the dimension
of the optimization problem. There exists two possible ways to do this. The first
one consists of relaxing the inequality constrains by introducing additional variables
which aim to tolerate the strict separation. We will allow the separation with cer-
tain error (some data points in the wrong side). This technique is introduced first by
62
Cortes C. and Vapnik V. (1995) under the name “Soft margin SVM”. The second one
consists of using the non-linear classifier which directly extend the function space to
higher dimension. The use of non-linear classifier can increase rapidly the dimension
of the optimization problem which invokes a computation problem. An intelligent
way to get over is employing the notion of kernel. In the next discussions, we will try
to clarify these two approaches then finish this section by introducing two general
frameworks of this learning theory.
Soft margin classification

In fact, the inequality constrains described above yi wT xi + b > 1 ensure that all
data points will be well classified with respect to the optimal hyperplane. As the data
may be inseparable, an intuitive way to overcome is relaxing the strict constrains by
introducing additional variables ξi with i = 1, . . . , n so-called slack variables. They
allow to commit certain error in the classification via new constrains:

yi w T xi + b > 1 − ξi i = 1...n (3.2)
For ξi > 1, the data point xi is completely misclassified whereas P 0 ≤ ξi ≤ 1 can be

interpreted as margin error. By this definition of slack variables, ni=1 ξi is directly
related to the number of misclassified points. In order to fix P
our expected error in the
classification problem, we introduce an additional term C ni=1 ξip in the objective
function and rewrite the optimization problem as following:
X n
1
min kwk2 + C ξi (3.3)
w,b,ξ 2
i=1

u.c. yi wT xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n
Here, C is the parameter used to fix our desired level of error and p ≥ 1 is an usual
way to fix the convexity on the additional term 1 . The soft-margin solution for the
SVM problem can be interpreted as a regularization technique that one can find
different optimization problem such as regression, filtering or matrix inversion. The
same result can be found with regularization technique later when we discuss the
possible use of kernel.
Before switching to next discussion on the non-linear classification with kernel

approach, we remark that the soft margin SVM problem is now at higher dimension
d + 1 + n. However, the computation cost will be not increased. Thank to the
KKT theorem, we can turn this primal problem to a dual problem with more simple
constrains. We can also work directly with the primal problem by effectuating a
trivial optimization on ξ. The primal problem is now no longer the a quadratic
program, however it can be solved by Newton optimization or conjugate gradient as
demonstrated in Chapelle O. (2007).
1
It is equivalent to define a Lp norm on the slack vector ξ ∈ Rn
63
Non-linear classification, Kernel approach

The second approach to improve the classification is to employ the non-linear SVM.
In the context of SVM, we would like to insist that the construction of non-linear
discriminant function h(x) consists of two steps. We first extend the data space
X of dimension d to a feature space F with higher dimension N via a non-linear
transformation φ : X → F, then a hyperplane will be constructed in the feature
space F as presented before:
h (x) = wT φ (x) + b
Here, the result vector z = (z1 , . . . , zN ) = φ (x) is N -component vector in F space,

hence w is also a vector of size N . The hyperplane H = {z : wT z + b = 0} defined
in F is no longer a linear decision boundary in the initial space X :
B = {x : wT φ (x) + b = 0}
At this stage, the generalization to non-linear case helps us to avoid the problem
of overfitting or underfitting. However, a computation problem emerges due to the
high dimension of the feature space. For example, if we consider an quadratic trans-
formation, it can lead to a feature space of dimension N = d(d + 3)/2. The main
question is how to construct the separating hyperplane in the feature space? The
answer to this question is to employ the mapping to the dual problem. By this
way, our N -dimension problem turn again to the following n-dimension optimization
problem with dual variable α:
n
X n
1 X
max αi − αi αj yi yj φ (xi )T φ (xj )
α 2
i=1 i,j=1
u.c. αi ≥ 0 i = 1...n
Indeed, the expansion of the optimal solution w? has the following form:
n
X
?
w = αi? yi φ (xi )
i=1
In order to solve the quadratic program, we do not need the explicit form of the
non-linear application but only the kernel of the form K (xi , xj ) = φ (xi )T φ (xj )
which is usually supposed to be symmetric. If we provide only the kernel K (xi , xj )
for the optimization problem, it is enough to construct later the hyperplane H in
the feature space F or the boundary decision in the data space X . The discriminant
function can be computed as following thank to the expansion of the optimal w? on
the initial data xi i = 1, . . . , n:
n
X
h (x) = αi yi K (x, xi ) + b
i=1
From this expression, we can construct the decision function which can be used to
classified a given input x as f (x) = sign (h (x)).
64
For a given non-linear function φ (x), we can compute the kernel K (xi , xj ) via
the scalar product of two vector in F space. However, the reciprocal result does not
stay unless the kernel satisfies the condition of the Mercer’s theorem (1909). Here, we
study some standard kernel which are already widely used in the pattern recognition
domain:
p
i. Polynomial kernel: K (x, y) = xT y + 1

ii. Radial Basis kernel: K (x, y) = exp −kx − yk2 /2σ 2

iii. Neural Network kernel: K (x, y) = tanh axT y − b
3.2.2 ERM and VRM frameworks

We finish the review on SVM by discussing briefly on the general framework of
Statistical Learning Theory including the SVM. Without enter into the detail like
the important theorem of Vapnik-Chervonenkis (1998), we would like to give a more
general view on the SVM by answering some questions like how to approach SVM as
a regression, how to interpret the soft-margin SVM as a regularization technique...
Empirical Risk Minimization framework

The Empirical Risk Minimization framework was studied by Vapnik and Chervo-
nenkis in the 70s. In order to show the main idea, we first fix some notations. Let
(xi , yi ), 1 ≤ i ≤ n be the training dataset of pairs input/output. The dataset is
supposed to be generated i.i.d from unknown distribution P (x, y). The dependency
between the input x and the output y is characterized in this distribution. For ex-
ample, if the input x has a distribution P (x, y) and the out put
is related to x via
function y = f (x) which is altered by a Gaussian noise N 0, σ 2 , then P (x, y) reads

P (x, y) = P (x) N f (x − y) , σ 2

We remark in this example that if σ → 0 then N 0, σ 2 tends to a Dirac distribution
which means that the relation between input and output can be exactly determined
by the maximum position of the distribution P (x, y). Estimating the function f (x)
is fundamental. In order to give measurement of the estimation quality, we compute
the expectation value of the loss function with respect to the distribution P (x, y).
We define here the loss function in two different contexts:
1. Classification: l (f (x) , y) = If (x)6=y where I is the indicator function.
2. Regression: l (f (x) , y) = (f (x) − y)2
The objective of statistical learning is to determine the function f in the a certain

function space F which minimizes the expected loss or the risk objective function:
Z
R (x) = l (f (x) , y) dP (x, y)
65
As the distribution P (x, y) is unknown then the expected loss can not be evaluated.
However, with available training dataset {xi , yi }, one could compute the empirical
risk as following:
n
1X
Remp = l (f (xi ) , y)
n
i=1
In the limit of large dataset n → ∞, we expect the convergence: Remp (f ) → R (f )
for all tested function f thank to the law of large number. However, does the learning
function f which minimizes Remp (f ) is the one minimizing the true risk R (f )? The
answer to this question is NO. In general, there is infinite number of function f which
can learn perfectly the training dataset f (x) = yi ∀i. In fact, we have to restraint
the function space F in order to ensure the uniform convergence of the empirical
risk to the true risk. The characterization of the complexity of a space of function F
was first studied in the VC theory via the concept of VC dimension (1971) and the
important VC theorem which gives an upper bound of the convergence probability
P {sup f ∈ F |R (f ) − Remp (f )| > ε} → 0.
A common way to restrict the function space is to impose a regularization condi-
tion. We denote Ω (f ) as a measurement of regularity, then the regularized problem
consists of minimizing the regularized risk:
Rreg (f ) = Remp (f ) + λΩ (f )
Here λ is the regularization parameter and Ω (f ) can be for example the Lp norm on
some deviation of f .
Vapnik and Chervonenkis theory

We are not going to discuss in detail the VC theory on the statistical learning machine
but only recall the most important result concerning the characterization of the
complexity of function class. In order to well quantify the trade-off between the
overfit problem and the inseparable data problem, Vapnik and Chervonenkis have
introduced a very important concept which is the VC dimension and the important
theorem which characterize the convergence of empirical risk function. First, the VC
dimension is introduced to measure the complexity of the class of functions F
Definition 3.2.1 The VC dimension of a class of functions F is defined as the
maximum number of point that can be exactly learned by a function of F:
n o
h = max |X|, X ⊂ X , such that ∀b ∈ {−1, 1}|X| , ∃f ∈ F ∀ xi ∈ X, f (xi ) = bi
(3.4)
With the definition of the VC dimension, we now present the VC theorems which is
a very powerful tool with control the upper limit of the convergence for the empirical
risk to the true risk function. These theorems allows us to have a clear idea about
the superior boundary on the available information and the number of observation in
the training set n. By satisfying this theorem, we can control the trade-off between
overfit and underfit. The relation between factors or coordinates of vector x and VC
dimension is given in the following theorem:
66
Theorem 3.2.2 (VC theorem of hyperplanes) Let F be the set of hyperplanes in Rd :

n o
F = x 7→ sign wT x + b , w ∈ Rd , b ∈ R
then VC dimension is d + 1
This theorem gives the explicit relation between the VC dimension and the number
of factors or the number of coordinates in the input vector of the training set. It
can be used in the next theorem in order to evaluate the necessary information for
having a good classification or regression.
Theorem 3.2.3 (Vapnik and Chervonenskis) let F be a class of function of VC

dimension h, then for any distribution P r and for any sample {(xi , yi )}i=1ṅ drawn
from this distribution, the following inequality holds true:
( ) ( )
2n 1 2
P r sup |R (f ) − Remp (f )| > ε < 4 exp h 1 + ln − ε− n
f ∈F h n
An important corollary of the VC theorem is the upper bound for the convergence
of the empirical risk function to the risk function:
Corollary 3.2.4 Under the same hypothesis of the VC theorem, the following in-
equality is hold with the probability 1 − η:
s η
h ln 2n
h + 1 − ln 4 1
∀f ∈ F, R (f ) − Remp (f ) ≤ +
n n
We will skip all the proofs of these theorems and postpone the discussion on the
importance of VC theorems important for practical use later in Section 6 as the
overfit and underfit problems are very present in any financial applications.
Vicinal Risk Minimization framework

Vicinal Risk Minimization framework (VRM) was formally developed in the work of
Chapelle O. (2000s). In EVM framework, the risk is evaluated by using empirical
probability distribution:
n
1X
dPemp (x, y) = δxi (x)δyi (y)
n
i=1
where δxi (x), δyi (y) are Dirac distributions located at xi and yi respectively. In the
VRM framework, instead of dPemp , the Dirac distribution is replaced by an estimate
density in the vicinity of xi :
n
1X
dPvic (x, y) = dPxi (x)δyi (y)
n
i=1
67
Hence, the vicinal risk is then defined as following:

Z n Z
1X
Rvic (f ) = l (f (x) , y) dPvic (x, y) = l (f (x) , yi ) dPxi (x)
n
i=1
In order to illustrate the different between the ERM framework and VRM framework,
let us consider the following example of the linear regression. In this case, our loss
function l (f (x) , y) = (f (x) − y)2 where the learning function is of the form f (x) =
wT x + b. Assuming that the vicinal density probability dPxi (x) is approximated by
a white noise of variance σ 2 . The vicinal risk is calculated as following:
n Z
1X
Rvic (f ) = (f (x) − yi )2 dPxi (x)
n
i=1
n Z
1X
= (f (xi + ε) − yi )2 dN 0, σ 2
n
i=1
Xn
1
= (f (xi ) − yi )2 + σ 2 kwk2
n
i=1
It is equivalent to the regularized risk minimization problem: Rvic (f ) = Remp (f ) +

σ 2 kwk2 of parameter σ 2 with L2 penalty constraint.
3.3 Numerical implementations

In this section, we discuss explicitly the two possible ways to implement the SVM
algorithm. As discussed above, the kernel approach can be applied directly in the
dual problem and it leads to a simple form of an quadratic program. We discuss first
the dual approach for the historical reason. Direct implementation for the primal
problem is little bit more delicate that why it was much more later implemented by
Chapelle O. (2007) by Newton optimization method and conjugate gradient method.
According to Chapelle O., in term of complexity both approaches propose more and
less the same efficiency while in some context the later gives some advantage on the
solution precision.
3.3.1 Dual approach

We discuss here in more detail the two main applications of SVM which are the clas-
sification problem and the regression problem within the dual approach. The reason
for the historical choice of this approach is simply it offers a possibility to obtain
a standard quadratic program whose numerical implementation is well-established.
Here, we summarize the result presented in Cortes C. and Vapnik V. (1995) where
they introduced the notion of soft-margin SVM. We next discuss the extension for
the regression.
68
Classification problem
As introduced in the last section, the classification encounters two main problems:
the overfitted problem and the underfitted problem. If the dimension of the function
space is two large, the result will be very sensible to the input then a small change in
the data can cause an instability in the final result. The second one consists of non-
separable data in the sense that the function space is too small then we can not obtain
a solution which minimizes the risk function. In both case, regularization scheme is
necessary to make the problem well-posed. In the first case, on should restrict the
function space by imposing some condition and working with some specific function
class (linear case for example). In the later case, on needs to extend the function
space by introducing some tolerable error (soft-margin approach) or working with
non-linear transformation.
a) Linear SVM with soft-margin approach
In the work of Cortes C. and Vapnik V. (1995), they have first introduced
the notion of soft-margin by accepting that there will be some error in the
classification. They characterize this error by additional variables ξi associated
to each data points xi . These parameters intervene in the classification
via the
constraints. For a given hyperplane, the constrain yi wT xi + b ≥ 1 which
means that the point xi is well-classified
and is out of the margin. When we
change this condition to yi w xi + b ≥ 1 − ξi with ξi ≥ 0 i = 1...n, it allow
T
first to point xi to be well-classified but in the margin for 0 ≤ ξi < 1. For the
value ξi > 1, there is a possibility that the input xi is misclassified. As written
above, the primal problem becomes an optimization with respect to the margin
and and the total committed error.
n
!
1 2
X p
min kwk + C.F ξi
w,b,ξ 2 i=1
T

u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n
Here, p is the degree of regularization. We remark that only for the choice of
p ≥ 1 the a soft-margin can have an unique solution. The function F (u) is
usually chosen as a convex function with F (0) = 0, for example F (u) = uk .
In the following we consider two specific cases: (i) Hard-margin limit with
C = 0; (ii) L1 penalty with F (u) = u, p = 1. We define the dual vector
Λ = (α1 , . . . , αn ) and the output vector y = (y1 , . . . , yn ). In order to write
the optimization problem in vectorial form, we define as well the operator
D = (Dij )n×n with Dij = yi yj xTi xj .
i. Hard-margin limit with C = 0. As shown in Appendix C.1.1, this problem
can be mapped to the following dual problem:
1
max ΛT 1 − ΛT DΛ (3.5)
Λ 2
T
u.c. Λ y = 0, Λ ≥ 0
69
ii. L1 penalty with F (u) = u, p = 1. In this case the associated dual problem
is given by:
1
max ΛT 1 − ΛT DΛ (3.6)
Λ 2
T
u.c. Λ y = 0, 0 ≤ Λ ≤ C1
The full derivation is given in Appendix C.1.2.
Remark 2 For the case with L2 penalty (F (u) = u, p = 2), we will demon-
strate in the next discussion that it is a special case of kernel approach for
hard-margin case. Hence, the dual problem is written exactly the same as hard-
margin case with an additional regularization term 1/2C added to the matrix
D:

T 1 T 1
max Λ 1 − Λ D+ I Λ (3.7)
Λ 2 2C
u.c. ΛT y = 0, Λ ≥ 0
b) Non-linear SVM with Kernel approach

The second possibility to extend the function space is to employ a non-linear
transformation φ (x) from the initial space X to the feature space F then
construct the hard margin problem. This approach conducts to the following
dual problem with the use of an explicit kernel K (xi , xj ) = φ (xi )T φ (xj ) in
stead of xTi xj . In this case, the D operator is a matrix D = (Dij )n×n with
element:
Dij = yi yj K (xi , xj )
With this convention, the two first quadratic programs above can be rewritten
in the context of non-linear classification by replacing D operator by this new
definition with the kernel.
We finally remark that, the case of soft-margin SVM with quadratic penalty
(F (u) = u, p = 2) can be also seen as the case of hard-margin SVM with √ a
modified Kernel. We introduce a new transformation φ̃ (xi ) = φ (xi ) 0 . . . yi / 2C . . . 0
√
where
the element yi / C is at i + dim(φ(xi )) position, and new vector w̃ =
√ √
w ξ1 2C . . . ξn C . In the new representation, the objective function kwk2 /2+
P
C ni=1 ξi2 becomes simply kw̃k2 /2 whereas the inequality constrain yi φ (w)T xi + b ≥

1 − ξi becomes yi w̃T φ̃ (xi ) + b ≥ 1. Hence, we obtain the hard-margin SVM
with a modified kernel which can be computed simply:
δij
K̃(xi , xj ) = φ̃ (xi )T φ̃ (xj ) = K(xi , xj ) +
2C
This kernel is consistent with QP program in the last remark.
70
In summary, the linear SVM is nothing else a special case of the non-linear SVM
within kernel approach. In the later, we study the SVM problem only for the two
case with hard and soft margin within the kernel approach. After obtaining the
optimal vector Λ? by solving the associated QP program described above, we can
compute b by
Pnthe KKT condition then derive the decision function f (x). We remind
that w = i=1 αi yi φ (x).
? ?
i. For the hard-margin case, KKT condition given in Appendix C.1.1:

αi? yi w?T φ (xi ) + b? − 1 = 0
We notice that for the value αi > 0, the inequality constraint becomes equal-
ity. As the inequality constraint becomes equality constrain, these points are
the closest points to the optimal frontier and they are called support-vectors.
Hence, b can be computed easily for a given support vector (xi , yi ) as following:
b? = yi − w?T φ (xi )
In order to enhance the precision of b? , we evaluate this value as the average

all over the set SV of support vectors :
1 X X
b? = yi − αj? yj φ (xj )T φ (xi )
nSV
i∈SV i,j∈SV
 
1 X X
= yi 1 − αi? K (xi , xj )
nSV
i∈SV j∈SV
ii. For the soft-margin case, KKT condition given in Appendix C.1.2 is slightly
different:
αi? yi w?T φ (xi ) + b? − 1 + ξi = 0
However, if αi satisfies the condition 0 ≤ αi ≤ C then we can show that
ξi = 0. The condition 0 ≤ αi ≤ C defines the subset of training points
(support vectors) which are closest to the frontier of separation. Hence, b can
be computed by exactly the same expression as the hard-margin case.
From the optimal value of the triple (Λ? , w? , b? ), we can construct the decision
function which can be used to classified a given input x as following:
n
!
X
?
f (x) = sign αi yi K (x, xi ) + b (3.8)
i=1
Regression problem
In the last sections, we have discussed the SVM problem only in the classification
context. In this section, we show how the regression problem can be interpreted as a
SVM problem. As discussed in the general frameworks of statistical learning (ERM
71
or VRM), the SVM problem consists of minimizing the risk function Remp or Rvic .
The risk function can be computed via the loss function l (f (x) , y) which defines
our objective (classification or regression). Explicitly, the risk function is calculated
as: Z
R (f ) = l (f (x) , y) dP (x, y)
where the distribution dP (x, y) can be computed in the ERM framework or in the
VRM framework. For the classification problem, the loss function is defined as
l (f (x) , y) = If (x)6=y which means that we count as an error whenever the given
point is misclassified. The minimization of the risk function for the classification
can be mapped then to the minimization of the margin 1/ kwk. For the regression
problem, the loss function is l (f (x) , y) = (f (x) − y)2 which means that we count
the loss as the error of regression.
Remark 3 We have chosen here the loss as the least-square error just for illustra-
tion. In general, it can be replaced by any positive function F of f (x) − y. Hence, we
have the loss function in general form l (f (x) , y) = F (f (x) − y). We remark that
the least-square case corresponds to L2 norm, then the most simple generalization
is to have the loss function as Lp norm l (f (x) , y) = |f (x) − y|p . We show later
that the special case with L1 can bring the regression problem to a similar form of
soft-margin classification.
In the last discussion on the classification, we have concluded that the linear-SVM
problem is just a special case of non-linear-SVM within kernel approach. Hence, we
will work here directly with non-linear case where the training vector x is already
transformed by a non-linear application φ (x). Therefore, the approximate function
of the regression reads f (x) = wT φ (x)+b. In the ERM framework, the risk function
is estimated simply as the empirical summation over the dataset:
n
1X
Remp = (f (xi ) − yi )2
n
i=1
whereas in the VRM framework, if we assume that dP (x, y) is a Gaussian noise of

variance σ 2 then the risk function reads:
n
1X
Rvic = (f (xi ) − yi )2 + σ 2 kwk2
n
i=1
The risk function in the VRM framework can be interpreted as a regulated form of
risk function in the ERM framework. We rewrite the risk function after renormalizing
it by the factor 2σ 2 :
X n
1 2
Rvic = kwk + C ξi2
2
i=1
with C = 1/2σ 2 n. Here, we have introduced new variables ξ = (ξi )i=1...n which
satisfy yi = f (xi ) + ξi = wT φ (xi ) + b + ξi . The regression problem can be now
72
written as a QP program with equality constrain as following:
X n
1
min kwk2 + C ξi2
w,b,ξ 2
i=1
u.c. yi = wT φ (xi ) + b + ξi i = 1...n
In the present form, the regression looks very similar to the SVM problem for the
classification. We notice that the regression problem in the context of SVM can be
easily generalized by two possible ways:
• The first way is to introduce more general loss function F (f (xi ) − yi ) instead
of the least-square loss function. This generalization can lead to other type of
regression such as ε-SV regression proposed by Vapnik (1998).
• The second way is to introduce a weight ωi distribution for the empirical dis-
tribution instead of the uniform distribution:
n
X
dPemp (x, y) = ωi δxi (x)δyi (y)
i=1
As financial quantities depend more on the recent pass, hence an asymmetric

weight distribution in the favor of recent data would improve the estimator.
The idea of this generalization is quite similar to exponential moving-average.
By doing this, we recover the results obtained in Gestel T.V. et al., (2001) and
in Tay F.E.H. and Cao L.J. (2002) for the LS-SVM formalism. For examples,
we can choose the weight distribution as proposed in Tay F.E.H. and Cao L.J.
(2002): ωi = 2i/n (n + 1) (linear distribution) or ωi = (1 + exp (a − 2ai/n))
(exponential weight distribution).
Our least-square regression problem can be mapped again to a dual problem

after introducing the Lagrangian. Detail calculations are given in Appendix C.1.
We give here the principle result which invokes again the kernel Kij = K (xi , xj ) =
φ (xi )T φ (xj ) for treating the non-linearity. Like the classification case, we consider
only two problems which are similar to the hard-margin and the soft-margin in the
context of regression.
i. Least-square SVM regression: In fact, the regression problem discussed

above similar to the hard-margin problem. Here, we have to keep the regular-
ization parameter C as it define a tolerance error for the regression. However,
this problem with the L2 constrain is equivalent to hard-margin with a modified
kernel. The quadratic optimization program is given as following:

T 1 T 1
max Λ y − Λ K+ I Λ (3.9)
Λ 2 2C
u.c. ΛT 1 = 0
73
ii. ε-SVM regression The ε-SVM regression problem was introduced by Vapnik
(1998) in order to have a similar formalism with the soft-margin SVM. He
proposed to employ the loss function in the following form:
l (f (x) , y) = (|y − f (x)| − ε) I{|y−f (x)|≥ε}
The ε-SVM loss function is just a generalization of L1 error. Here, ε is an
additional tolerance parameter which allows us not to count the regression
error small than ε. Insert this loss function into the expression of risk function
then we obtain the objective of the optimization problem:
X n
1
Rvic = kwk2 + C (|f (xi ) − yi | − ε) I{|yi −f (xi )|≥ε}
2
i=1
Because the two ensembles {yi −f (xi ) ≥ ε} and {yi −f (xi ) ≤ −ε} are disjoint.
We now break the function I{|yi −f (xi )|≥ε} into two terms:
I{|yi −f (xi )|≥ε} = I{yi −f (xi )−ε≥0} + I{f (xi )−yi −ε≥}
By introducing the slack variables ξ and ξ 0 as the last case which satisfy the
condition ξi ≥ yi − f (xi ) − ε and ξi0 ≥ f (xi ) − yi − ε. Hence, we obtain the
following optimization problem:
X n
1
min 0 kwk2 + C ξi + ξi0
w,b,ξ ,ξ 2
i=1
T
u.c. w φ (xi ) + b − yi ≤ ε + ξi , ξi ≥ 0 i = 1...n
T
yi − w φ (xi ) − b ≤ ε + ξi0 , ξi0 ≥ 0 i = 1...n
Remark 4 We remark that our approach gives exactly the same result as the
traditional approach discussed in the work of Vapnik (1998) in which the ob-
jective function is constructed by minimizing the margin with additional terms
defining the regression error. These terms are controlled by the couple of slack
variables.
The dual problem in this case can be obtained by performing the same calcu-
lation as the soft-margin SVM:
T T 1 T
max Λ − Λ0 y − ε Λ + Λ0 1− Λ − Λ0 K Λ − Λ0 (3.10)
Λ,Λ0 2
T
u.c. Λ − Λ0 1 = 0, 0 ≤ Λ, Λ0 ≤ C1
For the particular case with ε = 0, we obtain:
1
max ΛT y − ΛT KΛ
Λ 2
T
u.c. Λ 1 = 0, |Λ| ≤ C1
74
After the optimization procedure using QP program, we obtain the optimal vector
Λ? then compute b? by the KKT condition:
wT φ (xi ) + b − yi = 0
for support vectors (xi , yi ) (see Appendix C.1.3 for more detail). In order to have
good accuracy for the estimation of b, we average over the set of support vectors SV
and obtain:
 
n
XSV n
X
1 yi − αi?
b? = K (xi , xj )
nSV
i=1 j=1
The SVM regressor is then given by the following formula:

n
X
f (x) = αi? K (x, xi ) + b?
i=1
3.3.2 Primal approach

We discuss now the possible of an direct implementation for the primal problem. This
problem has been proposed and studied by Chapelle O. (2007). In this work, the
author argued that both primal and dual implementations give the same complexity
2
of the order O max (n, d) min (n, d) . Indeed, according to the author, the primal
problem might give a more accurate solution as it treats directly the quantity that
one is interested in. It is can be easily understood via the special case of a LS-SVM
linear estimator where both primal and dual problems can be solved analytically.
The main idea of primal implementation is to rewrite the optimization problem

under constraint as a unconstrained problem by performing a trivial minimization
on the slack variables ξ. We then obtain:
n
X
1 2
min kwk + C L yi , wT φ (xi ) + b (3.11)
w,b 2
i=1
Here, we have L (y, t) = (y − t)p for the regression problem whereas L (y, t) =
max (0, 1 − yt)p for the classification problem. In the case with quadratic loss or
L2 penalty, the function L (y, t) is differentiable with respect to the second variable
hence one can obtain the zero gradient equation. In the case where L (y, t) is not dif-
ferentiable such as L (y, t) = max (0, 1 − yt), we have to approximate it by a regular
function. Assuming that L (y, t) is differentiable with respect to t then we obtain:
n
X ∂L
w+C yi , wT φ (xi ) + b φ (xi ) = 0
∂t
i=1
which leads to the following representation of the solution w:

n
X
w= βi φ (xi )
i=1
75
By introducing the kernel Kij = K (xi , xj ) = φ (xi )T φ (xj ) we rewrite the primal
problem as following:
Xn
1
min β T Kβ + C L yi , KiT β + b (3.12)
β ,b 2 i=1
where Ki is the ith column of the matrix K. We note that it is now an uncon-
strained optimization problem which can be solved by gradient descent whenever
L (y, t) is differentiable. In Appendix C.1, we present detail derivation of the primal
implementation in for the case quadratic loss and soft-margin classification.
3.3.3 Model selection - Cross validation procedure

The possibility to enlarge or restrict the function space let us the possibility to obtain
the solution for SVM problem. However, the choice of the additional parameter such
as the error tolerance C in the soft-margin SVM or the kernel parameter in the
extension to non-linear case is fundamental. How can we choose these parameters
for a given data set? In this section, we discuss the calibration procedure so-called
“model selection ” which aims to determine the ensemble of parameters for SVM.
This discussion bases essentially on the result presented the O. Chapelle’s thesis
(2002).
In order to define the calibration procedure, let us first define the test function
which is used to estimate the SVM problem. In the case where we have a lot of
data, we can follow the traditional cross validation procedure by dividing the total
data in two independent sets: the training set and the validation set. The training
set {xi , yi }1≤i≤n is used for the optimization problem whereas the validation set
{x0i , yi0 }1≤i≤m is used to evaluate the error via the following test function:
m
1 X
T = ψ −yi0 f x0i
m
i=1
where ψ (x) = I{x>0} with IA the standard notation of the indicator function. In the
case where we do not have enough data for SVM problem, we can employ directly
the training set to evaluate the error via the “Leave-one-out error” . Let f 0 be the
classifier obtained by the full training set and f p be the one with the point (xp , yp )
left out. The error is defined by the test of the decision rule f p on the missing point
(xp , yp ) as following:
n
1X
T = ψ (−yp f p (xp ))
n
i=1
We focus more here the first test error function with available validation data set.
However, the error function requires the step function ψ which is discontinuous can
cause some difficulty if we expect to determine the best selection parameter via the
optimal test error. In order to perform the search for minimal test error by gradient
76
descent for example, we should smooth the test error by regulate the step function
by:
1
ψ̃ (x) =
1 + exp (−Ax + B)
The choice of the parameter A, B are important. If A is too small the approximation
error is too much whereas if A is large the test error is not smooth enough for the
minimization procedure.
3.4 Extension to SVM multi-classification

The single SVM classification (binary classification) discussed in the last section was
very well-established and becomes a very standard method for various applications.
However, the extension to the multi classification problem is not straight forward.
This problem still remains a very active research topic in the recognition domain.
In this section, we give a very quick overview on this progressing field and some
practical implementations.
3.4.1 Basic idea of multi-classification

The multiclass SVM can be formulated as following. Let (xi , yi )i=1...n be the training
set of data with characteristic x ∈ Rd under classification criterion y. For example,
the training data belong to m different classes labeled from 1 to m which means that
y ∈ {1, . . . , m}. Our task is to determine a classification rule F : Rd → {1, . . . , m}
based on training set data which aims to predict to which class belongs the test data
xt by evaluating the decision rule f (xt ).
Recently, many important contributions have progressed the field both in the
accuracy and complexity (i.e. reduction of time computation). The extensions have
been developed via two main directions. The first one consists of dividing the multi-
classification problem into many binary classification problem by using “one-against-
all” strategy or “one-against-one”. The next step is to construct the decision function
in the recognition phase. The implementation of the decision for “one-against-all”
strategy is based on the maximum output among all binary SVMs. The outputs are
usually mapped into an estimation probability which are proposed by different au-
thors such as Platt (1999). For “one-against-one”strategy, in order to take the right
decision, the Max Wins algorithm is adopted. The resultant class is given by the
one voted by the majority of binary classifiers. Both techniques encounter the limi-
tation of complexity and high cost of computation time. Other improvement in the
same direction such as the binary decision tree (SVM-BDT) was recently proposed
by Madzaro G. et al., (2009). This technique proved to be able to speed up the
computation time. The second direction consist of generalizing the kernel concept
in the SVM algorithm into a more general form. This method treats directly the
multiclassification problem by writing a general form of the large margin problem.
It will be again mapped into the dual problem by incorporating the kernel concept.
77
Crammer K. and Singer Y. (2001) introduced an efficient algorithm which decom-

poses the dual problem into multiple optimization problems which can be solved
later by fixed-point algorithm.
3.4.2 Implementations of multiclass SVM

We describe here the two principal implementations of SVM for multiclassification
problem. The first one concerns a direct application of binary SVM classifier, however
the recognition phase requires a careful choice of decision strategy. We next describe
and implement the multiclass kernel-based SVM algorithm which is a more elegant
approach.
Remark 5 Before discussing details of the two implementations, we remark that

there exists other implementations of SVM such as the application of Nonnegative
Matrix Factorization (Poluru V. K. et al., 2009) in the binary case by rewriting the
SVM problem in NMF framework. Extension of this application to multiclassification
case must be an interesting topic for future work.
Decomposition into multiple binary SVM
The most two popular extensions of single SVM classifier to multiclass SVM classifier
are using the one-against-all strategy and one-against-all strategy. Recently, another
technique utilizing the binary decision tree provided less effort in training the data
and it is much faster for recognition phase with a complexity of the order O [log2 N ].
All these techniques employ directly the above SVM implementation.
a) One-against-all strategy: In this case, we construct m single SVM classifiers

in order separate the training data from every class to the rest of classes.
Let us consider the construction of classifier separating class k from the rest.
We start by attributing the response zi = 1 if yi = k and zi = −1 for all
yi ∈ {1, . . . m} / {k}. Applying this construction for all classes, we obtain
finally the m classifiers f1 (x) , . . . , fm (x). For a testing data x the decision
rule is obtained by the maximum of the outputs given by these m classifiers:
y = argmaxk∈{1...m} fk (x)
In order to avoid the error coming from the fact that we compare the output
corresponding to different classifiers, we can map the output of each SVM into
the same form of probability proposed by Platt (1999):
1
P̂ r ( ωk | fk (x)) =
1 + exp (Ak fk (x) + Bk )
where ωk is the label of the k th class. This quantity can be interpreted as a

measure of the accepting probability of the classifier ωk for a given point x with
78
Pm
output fk (x). However, nothing guarantees that k=1 P̂ r ( ωk | fk (x)) = 1,
hence we have to renormalize this probability:
P̂ r ( ωk | fk (x))
P̂ r ( ωk | x) = Pm
j=1 P̂ r ( ωj | fj (x))
In order to obtain these probability, we have to calibrate the parameters

(Ak , Bk ). It can be realized by performing the maximum likehood on the
training set (Platt (1999)).
b) One-against-one strategy: Other way to employ the binary SVM classifier
is to construct Nc = m(m − 1)/2 binary classifiers which separate all couples
of classes (ωi , ωj ). We denote the ensemble of classifiers C = {f1 , . . . , fNc }.
In the recognition phase, we evaluate all possible outputs f1 (x) , . . . , fNc (x)
over C for a given point x. These outputs can be mapped to the response
function of each classifier signfk (x) which determines to which class the point
x belongs with respect to the classifier fk . We denote N1 , . . . , Nm numbers of
times that the point x is classified in the classes ω1 , . . . , ωm respectively. Using
the responses we can construct a probability distribution P̂ r ( ωk | x) over the
set of classes {ωk }. This probability again is used to decide the recognition of
x.
c) Binary decision tree: Both methods above are quite easy for implementa-
tion as they employ directly the binary solver. However, they are all suffer a
high cost of computation time. We discuss now the last technique proposed
recently by Madazarov G. et al., (2009)which uses the binary decision tree strat-
egy. With advantage of the binary tree, the technique gains both complexity
and computation time consumption. It needs only m − 1 classifiers which do
not always run on the whole training set for constructing the classifiers. By
construction, for recognizing a testing point x, it requires only O (log2 N ) eval-
uation by descending the tree. Figure 3.2 illustrates how this algorithm works
for classifying 7 classes.
Multiclass Kernel-based Vector Machines

A more general and elegant formalism can be obtained for multiclassification by
generating the concept kernel. Within this discussion, we follow the approach given
in the work of Crammer G. et al., (2001) but with more geometrical explanation. We
demonstrate that this approach can be interpreted as a simultaneous combination of
“one-against-all” and “one-against-one” strategies.
As in the linear case, we have to define a decision function. For the binary case,
f (x) = sign (h (x)) where h (x) is the boundary (i.e. f (x) = +1 if x ∈ class 1
whereas f (x) = −1 if x ∈ class 2). For the decision function must as-well indicate
the class index. In the work of Crammer K. et al., (2001), they proposed to construct
the decision rule F : Rd → {1, . . . , m} as following:

F (x) = argmaxk∈{1,...,m} WkT x
79
Figure 3.2: Binary decision tree strategy for multiclassification problem
Here, W is the d × m weight matrix in which each column Wk corresponds to a d × 1

weight vector. Therefore, we can rewrite the weight matrix as W = (W1 W2 . . . Wm ).
We remind that the vector x is of dimension d. In fact, the vectors Wk corresponding
to k th class can be interpreted as the normal vector of the hyperplan in the binary
SVM. It characterizes the sensibility of a given point x to the k th class. The quantity
WkT x is similar to a “score ” that we attribute to the class ωk .
Remark 6 This construction looks quite similar to the “one-against-all” strategy.

The main difference is that for the “one-against-all” strategy, all vectors W1 . . . Wm
are constructed independently one by one with binary SVM whereas within this for-
malism, they are constructed spontaneously all together. We will show in the following
that the selection rule of this approach is more similar to “one-against-one” strategy.
Remark 7 In order to have an intuitive geometric interpretation, we treat here the

case of linear-classifier. However, the generation
to non-linear case will be straight
forward when we replace xTi xj by φ xTi f (xj ). This step introduces the notion of
kernel K (xi , xj ) = φ (xi )T φ (xj ).
By definition Wk is the vector defining the boundary which distinguishes the class
ωk from the rest. It is a normal vector to the boundary and point to the region
occupied by class ωk . Assuming that we are able to separate correctly all data by
classifier W. For any point (x, y) when we compute the position of x with respect to
two classes ωy and ωk for all k 6= y, we must find that x belongs to class ωy . As Wk
defines the vector pointing to the class ωk , hence when we compare a class ωy to a
class ωk , it is natural to define the vector Wy − Wk to define the vector point to class
ωy but not ωk . As consequence, Wk − Wy is the vector point to class ωk but not ωy .
When x is well classified, we must have Wy − Wk x > 0 (i. e. the class ωy has
T T
80
the best score). In order to have a margin like the binary case, we impose strictly
that WyT − WkT x ≥ 1 ∀k 6= y. This condition can be translated for all k = 1 . . . m
by adding δy,k (the Kronecker symbol) as following:

WyT − WkT x + δy,k ≥ 1
Therefore, solving the multi-classification problem for training set (xi , yi )i=1...n is
equivalent to finding W satisfying:

WyTi − WkT xi + δyi ,k ≥ 1 ∀i, k
We notice here that w = WiT − WjT is normal vector to the separation boundary

Hw = z|wT z + bij = 0 between two classes ωi and ωj . Hence the width of the
margin between two classes is as in the binary case:
1
M (Hw ) =
kwk
Maximizing the margin is equivalent to minimizing
the norm kwk. Indeed, we have
kwk = kWi − Wj k ≤ 2 kWi k + kWj k . In order to maximize all the margin at
2 2 2 2
the same time, it turns out that we have to minimize the L2 -norm of the matrix W:
m
X m X
X d
kWk22 = kWi k2 = Wij2
i=1 i=1 j=1
Finally, we obtain the following optimization problem:

1
min kWk2
W 2

u.c. WyTi − WkT xi + δyi ,k ≥ 1 ∀i = 1 . . . n, k = 1 . . . m
The extension the similar case with“soft-margin” can be formulated easily bu in-
troducing the slack variables ξi corresponding for each training data. As before,
this slack variable allow the point to be classified in the margin. The minimization
problem now becomes:
n
!
1 X
min kWk2 + C.F ξip
W,ξ 2 i=1
T T

u.c. Wyi − Wk xi + δyi ,k ≥ 1 − ξi , ξi ≥ 0 ∀i, k
Remark 8 Within the ERM or V RM frameworks, we can construct the risk func-
tion via the loss function l (x) = I{F (x)6=y} for the couple of data (x, y). For example,
in the ERM framework, we have:
n
1X
Remp (W) = I{F (xi )6=yi }
n
i=1
81
The classification problem is now equivalent to find the optimal matrix W? which
minimizes the empirical risk function. In the binary case, we have seen that the
optimization of risk function is equivalent to maximizing the margin kwk2 under
linear constraint. We remark that in VRM framework, this problem can be tackled
exactly as the binary case. In order to prove the equivalence of minimizing the risk
function with the large margin principle, we look for a linear superior boundary the
indicator function I{F (x)6=y} . As shown in Crammer K. et al., (2001), we consider
the following function:

g (x, y; k) = WkT − WyT x + 1 − δy,k
In fact, we can prove that
I{F (x)6=y} ≤ g (x, y) = max g (x, y; k) ∀ (x, y)

k

We first remark that g (x, y; y) = WyT − WyT x + 1 − δy,y = 0, hence g (x, y) ≥
g (x, y; y) = 0. If the point (xi , yi ) satisfies F (xi ) = yi then WyTi x = maxk WkT xi
and I{F (x)6=y} (xi ) = 0. For this case, it is obvious that I{F (x)6=y} (xi ) ≤ g (xi , yi ). If
we have now F (xi ) 6= yi then WyTi x < maxk WkT xi and I{F (x)6=y} (xi ) = 1. In this

case, g (x, y) = maxk WkT x − WyT + 1 ≥ 1. Hence, we obtain again I{F (x)6=y} (xi ) ≤
g (xi , yi ). Finally, we obtain the upper boundary of the risk function by the following
expression:
n
1X
Remp (W) ≤ max WkT − WyTi xi + 1 − δyi ,k
n k
i=1
If the the data is separable, then the optimal value of the risk function is zero. If one
require that the superior boundary of the risk function is zero, then the W? which
optimizes this boundary must be the one optimize Remp (W). The minimization can
be expressed as:
max WkT − WyTi xi + 1 − δyi ,k = 0 ∀i
k
or in the same form of the large margin problem:

WyTi − WkT xi + 1 + δyi ,k ≥ 1 ∀i, k
Follow the traditional routine for solving this problem, we map it into the dual
problem as in the case with binary classification. The detail of this mapping is given
in K. Crammer and Y. Singer (2001). We summarize here their important result
in the dual form with the dual variable η i of dimension m with i = 1 . . . n. Define
τ i = 1yi − η i where 1yi is zero column vector except for ith element, then in the case
with soft margin p = 1 and F (u) = u we have the dual problem:
n
!
1X T T 1 X T
max Q (τ ) = − xi x j τ i τ j + τ i 1 yi
τi 2 C
i,j i=1
u.c. τ i ≤ 1yi and τ Ti 1 = 0 ∀i
82
We remark here again that we obtain a quadratic program which involves only the
interior product between all couples of vector xi , xj . Hence the generation to non-
linear is straight forward with the introduction of the kernel
concept. The general
problem is finally written by replacing the the factor xTi xj by the kernel K (xi , xj ):
n
!
1X 1 X
max Q (τ ) = − K (xi , xj ) τ Ti τ j + τ Ti 1yi (3.13)
τi 2 C
i,j i=1
u.c. τ i ≤ 1yi and τ Ti 1 = 0 ∀i (3.14)
The optimal solution of this problem allows to evaluate the classification rule:
( n )
X
H(x) = arg max τ i,r K (x, xi ) (3.15)
r=1...m
i=1
For small value of class number m, we can implement the above optimization by
the traditional QP program with matrix of size mn × mn. However, for important
number of class, we must employ efficient algorithm as stocking a mn×mn is already
a complicate problem. Crammer and Singer have introduced an interesting algorithm
which optimize this optimization problem both in stockade and computation speed.
3.5 SVM-regression in finance

Recently, different applications in the financial field have been developed through
two main directions. The first one employs SVM as non-linear estimator in order to
forecast the market tendency or volatility. In this context, SVM is used as a regression
technique with feasible possibility for extension to non-linear case thank to the kernel
approach. The second direction consists of using SVM as a classification technique
which aims to elaborate the stock selection in the trading strategy (for example
long/short strategy). The SVM-regression can be considered as a non-linear filter
for times series or a regression for evaluating the score. We discuss first here how to
employ the SVM-regression as as an estimators of the trend for a given asset. The
observed trend can be used later for momentum strategies such as trend-following
strategy. We next use SVM as a method for constructing the score of the stock for
long/short strategy.
3.5.1 Numerical tests on SVM-regressors

We test here the efficiency of different regressors discussed above. They can be
distinguished by the form of loss function into L1 -type or L2 type or by the form
of non-linear kernel. We do not focus yet on the calibration of SVM parameter and
reserve it for the next discussion on the trend extraction of financial time series with
a full description of cross validation procedure. For a given times series yt we would
like to regress the data with the training vector x = t = (ti )i=1...n . Let us consider
83
two model of time series. The first model is simply an determined trend perturbed
by a white noise:
yt = (t − a)3 + σN (0, 1) (3.16)
The second model for our tests is the Black-Scholes model of the stock price:
dSt
= µt dt + σt dBt (3.17)
St
We notice here that the studied signal yt = ln St . The parameters of the model are
the annualized return µ = 5% and the annulized volatility σ = 20%. We consider
the regression on a period of one year corresponding to N = 260 trading days.
The first test consists of comparing the L1 -regressor and L2 -regressor for Gaussian
kernel (see Figures 3.3-3.4). As shown in Figure 3.3 and Figure 3.4, the L2 -regressor
seems to be more favor for the regression. Indeed, we observe that the L2 -regressor is
more stable than the L1 -regressor (i.e L1 is more sensible to the training data set) via
many test on simulated data of Model 3.17. In the second test, we compare different
L2 regressions corresponding to four typical kernel: 1. Linear, 2. Polynomial, 3.
Gaussian, 4. Sigmoid.
Figure 3.3: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16)
20
15
10
5
yt
−5
−10
−15 Real signal

L1 regression
L2 regression
−20
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t
3.5.2 SVM-Filtering for forecasting the trend of signal

Here, we employ SVM as a non-linear filtering technique for extracting the hidden
trend of a time series signal. The regression principle was explained in the last
84
0.1
0.05
0
ln(St /S0 )
−0.05
−0.1
−0.15
−0.2
Real signal
L1 regression
L2 regression
−0.25
0 50 100 150 200 250 300
t
Figure 3.5: Comparison of different regression kernel for model (3.16)
20
15
10
5
yt
−5
−10
Real signal
−15 Linear
Polynomial
Gaussian
Sigmoid
−20
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
t
85
Figure 3.6: Comparison of different regression kernel for model (3.17)
0.15
0.1
0.05
0
yt
−0.05
−0.1
Real signal
−0.15 Linear
Polynomial
Gaussian
Sigmoid
−0.2
0 50 100 150 200 250 300
t
discussion. We apply now this technique for estimating the derivative of the trend
µ̄t , then plug it into a trend-following strategy.
Description of trend-following strategy

We choose here the most simple trend-following strategy whose exposure is given by:
µ̂t
et = m
σ̂t2
with m the risk tolerance and σ̂t the estimator of volatility given by:
Z T t
X
1 1 Si
σ̂t2 = σt2 dt = ln2
T 0 T Si−1
i=t−T +1
In order to limit the risk of explosion of the exposure et , we capture it by a superior

and inferior boundaries emax and emin :

µ̂t
et = max min m 2 , emin , emax
σ̂t
The wealth of the portfolio is then given by the following expression:

? St+1 ?
Wt+1 = Wt + Wt et − 1 + (1 − et )rt
St
86
SVM-Filtering
We discuss now how to build a cross-validation procedure which can help to learn
the trend of a given signal. We employ the moving-average as a benchmark to
compare with this new filter. An important parameter in moving-average filtering
is the estimation horizon T then we use this horizon as a reference to calibrate
our SVM-filtering. For the sake of simplicity, we studied here only the SVM-filter
with Gaussian kernel and L2 penalty. The two typical parameters of SVM-filter
are C and σ. C is the parameter which allows some certain level of error in the
regression curve while σ characterizes the horizon of estimation and it is directly
proportional to T . We propose to scheme of the validation procedure which base
on the following structure of data division: training set, validation set and testing
set. In the first scheme, we fix the kernel parameter σ = T and optimize the error
tolerance parameter C on the validation set. This scheme is comparable to our
benchmark moving-average. The second scheme consists of optimizing both couple
of parameter C, σ on the validation set. In this case, we let our validation data
decides its estimation horizon. This scheme is more complicate to interpret as σ is
now a dynamic parameter. However, by affecting σ to the local horizon, we can have
an additional understanding on the change in the price of the underlying asset. For
example, we can determine in the historical data if the underlying asset undergoes a
period with long or short trend. It can help to recognize some additional signature
such as the cycle of between long and short trends. We report the two schemes in
the following algorithm.
Figure 3.7: Cross-validation procedure for determining optimal value C ? σ ?
Training Validation Forecasting

| -| - | -
T1 T2 T2 -
| k
Backtesting
We first check the SVM-filter with simulated data given by the Black-Scholes model of
the price. We consider a stock price with annualized return µ = 10% and annualized
volatility σ = 20%. The regression is based on 1 trading year data (n = 260 days)
with a fixed horizon of 1 month T = 20 days. In Figure 3.8, we present the result
of the SVM trend prediction with fixed horizon T = 20 whereas Figure 3.9 presents
the SVM trend prediction for the second scheme.
3.5.3 SVM for multivariate regression

As a regression method, we can employ SVM for the use of multivariate
regression.
Assuming that we consider an universal of d stocks X = X i=1...d during the
(i)
87
Figure 3.8: SVM-filtering with fixed horizon scheme
0.15
0.1
0.05
0
yt
−0.05
−0.1
−0.15 Real signal

Training
Validation
Prediction
−0.2
0 50 100 150 200 250 300
t
Figure 3.9: SVM-filtering with dynamic horizon scheme
0.15
0.1
0.05
0
yt
−0.05
−0.1
−0.15 Real signal

Training
Validation
Prediction
−0.2
0 50 100 150 200 250 300
t
88
Algorithm 3 SVM score construction

procedure SVM_Filter(X, y, T )
Divide data into training set Dtrain , validation set Dvalid and testing set Dtest
Regression on the training data Dtrain
Construct the SVM prediction on validation set Dvalid
if Fixed horizon then
σ=T
Compute Error(C) prediction error on Dvalid
Minimize Error(C) and obtain the optimal parameters (C ? )
else
Compute Error(σ, C) prediction error on Dvalid
Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? )
end if
Use optimal parameters to predict the trend on testing set Dtest
end procedure
period of n dates. The performance of the index or an individual stock that we are
interested in is given by y. We are looking for the prediction of the value of yn+1 by
using the regression of the historical data of (Xt , yt )t=1...n . In this case, the different
stocks play the role of the factors of vector in the training set. We can as well apply
other regression like the prediction of the performance of the stock based on available
information of all the factors.
Multivariate regression
We first test here the efficiency of the multivariate regression on a simulated model.
Assuming that all the factors at a given date j follow a Brownian motion.
(i) (i)
dXt = µt dt + σt dBt ∀i = 1...d
Let (yt )1ṅ be the vector to be regressed which is related to the input X by a function:
yt = f (Xt ) = WtT Xt
We would like to regress the vector y = (yt )t=2...n by the historical data (Xt )t=1...n−1
by SVM-regression. This regression is give by the function yt = F (Xt−1 ). Hence,
the prediction of the future performance of yn+1 is given by:
E [yn+1 |Xn ] = F (Xn )
In Figure 3.10, we present the results obtained by Gaussian kernel with L1 and
L2 penalty condition whereas in Figure 3.11, we compare the result obtained with
different types of kernel. Here, we consider just a simple scheme with the lag of one
trading day for the regression. In all Figures, we remark this lack on the prediction
of the value of y.
89
2
yt
−1
−2
Real signal
L1 regression
L2 regression
−3
0 50 100 150 200 250 300 350 400 450 500
t
Figure 3.11: Comparison of different kernels for multivariate regression
2
yt
−1
Real signal
Linear
−2 Polynomial
Gaussian
Sigmoid
−3
0 50 100 150 200 250
t
90
Backtesting
3.6 SVM-classification in finance

We now discuss the second applications of SVM in the finance as a stock classifier
within this section. We will first test our implementations of the binary classifier and
multiclassifier. We next employ the SVM technique to study two different problems:
(i) recognition of sectors and (ii) construction of SVM score for stock picking strategy.
3.6.1 Test of SVM-classifiers

For the binary classification problem, we consider the both approaches (dual/primal)
to determine the boundary between two given classes based on the available infor-
mation of each data point. For the multiclassification problem, we first extend the
binary classifier to the multi-class case by using the binary decision tree (SVM-
BDT). This algorithm was demonstrated to be more efficient than the traditional
approaches such as “one-against-all” or “one-against-one” both in computation time
and in precision. The general approach of multi-SVM will be then compared to
SVM-BDT.
Binary-SVM classifier
Let us compare here the two proposed approaches (dual/primal) for solving numeri-
cally SVM-classification problem. In order to realize the test, we consider a random
training data set of n vector xi with classification criterion yi = sign (xi ). We present
here the comparison of two classification approaches with linear kernel. Here, the
result of primal approach is directly obtained by the software of O. Chapelle 2 . This
software was implemented with L2 penalty condition. Our dual solver is implemented
for both L1 and L2 penalty conditions by employing simply the QP program. In Fig-
ure 3.12, we show the results of classification obtained by both methods within L2
penalty condition.
We test next the non-linear classification by using the Gaussian kernel (RBF
kernel) for the binary dual-solver. We generate the simulated data by the same way
as the last example with x ∈ R2 . The result of the classification is illustrated in
Figure 3.13 for RBF kernel with parameter C = 0.5 and σ = 2 3 .
Multi-SVM classifier
We first test the implementation of SVM-BDT via simulated data (xi )i=1...n which
are generated randomly. We suppose that these data are distributed in Nc classes.
In order to test efficiently our multi-SVM implementation, the response vector y =
2
The free software of O. Chapelle can be found in the following website
http://olivier.chapelle.cc/primal/
3
We used here the “plotlssvm ” function of the LS-SVM toolbox for graphical illustration. Similar
result was aso obtained by using “trainlssvm” function in the same toolbox.
91
Figure 3.12: Comparison between Dual algorithm and Primal algorithm
Primal, Dual, Boundary, Margins

6
2
h(x, y)
−2
−4
0 10 20 30 40 50 60 70 80 90 100
Training data
Figure 3.13: Illustration of non-linear classification with Gaussian kernel
2.5
1
class 1
2 class 2
1.5
1
1
0.5
1
X2
−0.5
1
−1
−1.5
−2
−2.5
1
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

X1
92
(y1 . . . yn ) is supposed to be dependent only on the first coordinate of the data vector:
z = U (0, 1)
x1 = Nc z
y = [x1 ] + N (0, 1)
xi = U (0, 1) ∀i > 1
Here [a] denote the part of a. We can generate our simulated data in much more
general way but it will be very hard to visualize the result of the classification.
Within the above choice of simulated data, we can see that in the case = 0 the
data a separable in the axis x1 . In the geometric view, the space Rd is divided in to
Nc zones along the axis x1 : Rd−1 × [0, 1[, . . . , Rd−1 × [Nc , Nc + 1[. The boundaries
are simply the Nc hyperplane Rd−1 crossing x1 = 1 . . . Nc . When we introduce some
noise on the coordinate x1 ( > 0), then the training set is now is not separable
by these ensemble of linear hyperplanes. There will be some misclassified points
and some deformation of the boundaries thank to non-linear kernel. For the sake of
simplicity, we assume that the data (x, y) are already gathered by group. In Figures
?? and 3.15, we present the classification results for in-sample data and out-of-simple
data in the case = 0 (i.e. separable data). We are now introduce the noise in the
Figure 3.14: Illustration of multiclassification with SVM-BDT for in-sample data
C10
C09
C08
C07
C06
Classes
C05
C04
C03
C02
Real sector distribution
Multiclass SVM
C01
S10 S20 S30 S40 S50 S60 S70 S80 S90 S99
Stocks
data coordinate x1 with = 0.2.
93
Figure 3.15: Illustration of multiclassification with SVM-BDT for out-of-sample data
C10
C09
C08
C07
C06
Classes
C05
C04
C03
C02
Multiclass SVM
C01
S05 S10 S15 S20 S25 S30 S35 S40 S45 S50
Stocks
Figure 3.16: Illustration of multiclassification with SVM-BDT for = 0
1.2
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
0.8
x2
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9
x1
94
Figure 3.17: Illustration of multiclassification with SVM-BDT for = 0.2
1.2
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
0.8
x2
0.6
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10
x1
3.6.2 SVM for classification

We employ here multi-SVM algorithm for all the compositions of the Eurostoxx 300
index. Our goal is to determine the boundaries between various sector to which
belong the compositions of the index. As the algorithm contains two main parts,
classification and prediction, we then can classify our stocks via their common prop-
erties resulted from the available factors. The number of misclassified stocks or the
error of classification can give us an estimation on sector definition. We next study
the recognition phase on the ensemble of tested data.
Classification of stocks by sectors

In order to well classify the stocks composing the Eurostoxx 300 index, we consider
the Ntrain = 100 most representative stocks in term of value. In order to establish the
multiclass-svm classification using the binary decision tree, we sort the Ntrain = 100
assets by sectors. We next employing the SVM-BDT for computing the Ntrain − 1
binary separators. In Figure 3.18, we present the classification result with Gaussian
kernel and L2 penalty condition. For σ = 2 and C = 20, we are able to well
classify the 100 assets over ten main sectors: Oil & Gas, Industrials, Financial,
Telecommunications, Health Care, Basic Materials, Consumer Goods, Technology,
Utilities, Consumer Services.
In order to check the efficiency of the classification, we test the prediction quality
on a test set which is composed by Ntest = 50 assets. In Figure 3.19, we compare
the SVM-BDT result with the true sector distribution of 50 assets. We obtain in
95
Figure 3.18: Multiclassification with SVM-BDT on training set
Consumer Services
Utilities
Technology
Consumer Goods
Basic Materials
Health Care
Telecommunications
Financials
Industrials
Multiclass SVM
Oil & Gas
S1 S10 S20 S30 S40 S50 S60 S70 S80 S90 S100
this case the rate of correct prediction is about 58%.
Calibration procedure
As discussed above in the implementation part of the SVM-solver, there are two
kinds of parameter which play important role in the classification process. The first
parameter C concerns the tolerance error of the margin and the second parameters
concern the choice of kernel (σ for Gaussian kernel for example). In last example,
we have optimized the couple of parameters C, σ in order to have the best classifiers
which do not commit any error on the traing set. However, this result is true only
in the case if the sectors are correctly defined. Here, nothing guaranties that the
given notion of sectors is the most appropriate one. Hence, the classification process
should consist of two steps: (i) determine of binary SVM classifiers on training data
set and (ii) calibrate parameters on the validation set. In fact, we decide to optimize
this couple of parameters C, σ by minimizing the realized error on the validation
set because the committed error on the training set (learning set) must be always
smaller than the one on validation set (unknown set). In the second phase, we can
redefine the sectors in the sens that if any asset is misclassified, we change its sector
label and repeat the optimization on the validation set until convergence. In the end
of the calibration procedure, we expect to obtain first a new recognition of sectors
and second a multi-classifiers for new assets.
As SVM uses the training set to lean about the classification, it must commits
less error on this set than on the validation set. We propose here to optimize the
96
Figure 3.19: Prediction efficiency with SVM-BDT on the validation set
Consumer Services
Multiclass SVM
Utilities
Technology
Consumer Goods
Basic Materials
Health Care
Telecommunications
Financials
Industrials
Oil & Gas

S101 S110 S120 S130 S140 S150
SVM parameters by minimizing the error on the validation set. We use the same
error function defined in Section 3 but apply it on the validation data set V:
1 X
Error = ψ −yi0 f x0i
card (V)
i∈V
where ψ (x) = I{x>0} with IA the standard notation of the indicator function. How-
ever, the error function requires the step function ψ which is discontinuous can cause
some difficulty if we expect to determine the best selection parameter via the optimal
test error. In order to perform the search for minimal test error by gradient descent
for example, we should smooth the test error by regulate the step function by:
1
ψ̃ (x) =
1 + exp (−Ax + B)
The choice of the parameter A, B are important. If A is too small the approximation
error is too much whereas if A is large the test error is not smooth enough for the
minimization procedure.
Recognition of sectors
By construction, SVM-classifier is a very efficient method for recognize and classify
a new element with respect to a given number of classes. However, it is not able to
recognize the sectors or introduces a new correct definition of available sectors over
an universal of available data (stocks). In finance, the classification by sector is more
97
related to the origin of stock than the intrinsic property of the stock in the market.
It may introduce some problem on the trading strategy if a stock is misclassified, for
example, the case of pair-trading strategy. Here, we try to overcome this weakness
point of SVM in order to introduce a method which modifies the initial definition of
sectors.
The main idea of sector recognition procedure is the following. We divide the
available data into two sets: training set and validation set. We employ the training
set to learn about the classification and the validation set to optimize the SVM
parameters. We start with the initial definition of the given sectors. Within each
iteration, we learn the training set in order to determine the classifiers then we test
the validation error. An optimization procedure on the validation error helps us to
determine the optimal parameters of SVM. For each ensemble of optimal parameters,
we encounter some error on the training set. If the validation is smaller on certain
threshold with no error on the training set, we reach the optimal configuration of
sector definition. In the case, there are errors on the training set, we relabel the
misclassified data point and define new sectors with this correction. All the sector
labels will be changed by this rule for both training and validation sets. The iteration
procedure will be repeat until no error on the training set is committed for a given
expected threshold of error on the validation set. The algorithm of this sector-
recognition procedure is summarized in the following table:
Algorithm 4 Sector recognition by SVM classification

procedure SVM_SectorRecognition(X, y, ε)
Divide the historical data by training set T and validation set V
Initiate the sectors label by the physical sector names: Sec01 , . . . , Sec0m
while E T > do
while E V > do
Compute the SVM separators for labels Sec1 , . . . , Secm on T for given
(C, σ)
Construct the SVM predictor from the separator Sec1 , . . . , Secm
Compute error E V on validation set
Update parameter (C, σ) until convergence of E V >
end while
Compute error E T on training set
Verify misclassified points of training set
Relabel the misclassified points then update definition of sectors
end while
end procedure
3.6.3 SVM for score construction and stock selection

Traditionally, in order to improve the stock picking we rank the stocks by construct-
ing a “score” based on all characterizations (so-called factor) of the considered stock.
We require that the construction of this global quantity (combination of factors)
98
must satisfy some classification criterion, for example the performance. We denote
here the (xi )i=1...n with xi the ensemble of factors for the ith stock. The classification
criterion such as the performance is denoted by the vector y = (yi )i=1...n . The aim
of SVM-classifier in this problem is to recognize which stocks (scores) belong to the
high/low performance class (overperformed/underperformed). More precisely, we
have to identify the a boundary of separation as a function of score and performance
f (x, y). Hence, the SVM stock peaking consists of two steps: (i) construction of
factors ensemble (i.e. harmonize all characterizations of a given stock such as the
price, the risk, marco-properties e.t.c into comparable quantities); (ii) application of
SVM-classifier algorithm with adaptive choice of parameters. In the following, we
are going to first give a brief description of score constructions and then establish
the backtest on stock-picking strategy.
Probit model for score construction

We summary here briefly the main idea of the score construction by the Probit
model. Assuming that the set of training data (xi , yi )i=1...n is available. Here x is
the vector of factors whereas y is the binary response. We look for constructing a
conditional probability distribution of the random variable Y for a given point X.
This probability distribution can be used later to predict the response of a new data
point xnew . The probit model suppose to estimate this conditional probability in the
form:
P r (Y = 1 |X ) = Φ XT β + α
with Φ (x) the cumulative distribution function (CDF) of the standard normal dis-
tribution. The couple of parameters (α, β) can be obtained by using estimators of
maximum likehood. The choice of the function Φ (x) is quite natural as we work
with a binary random variable because it allows to have a symmetric probability
distribution.
Remark 9 We remark that this model can be written in another form with the in-
troduction of a hidden random variable:
Y ? = XT β + α +
where ∼ N (0, 1). Hence, Y can be interpreted as an indicator for whether Y ? is

positive.
1 if Y ? > 0
Y = I{Y ? >0} =
0 otherwise
In finance, we can employ this model for the score construction. If we define the
binary variable Y is the relative return of a given asset with respect to the benchmark:
Y = 1 if the return of is higher than the one of the benchmark and Y = 0 otherwise.
Hence, P r (Y = 1|X) is the probability for the give asset with the vector of factors X
to be super-performed. Naturally, we can define this quantity as a score measuring
the probability of gain over the benchmark:
S = P r (Y = 1|X)
99
In order to estimate the regression parameters α, β, we maximize the log-likehood

function:
n
X
L (α, β) = yi ln Φ xTi β + α + (1 − yi ) ln 1 − Φ xTi β + α
i=1
Using the estimated parameters by maximum likehood, we can predict the score of
the a given asset with its factor vector X as following:

Ŝ = Φ XT β̂ + α̂
The probability distribution of the score Ŝ can be computed by the empirical formula
1X n
P r Ŝ < s = I{Si <s}
n
i=1
However, if we compute the density distribution function (PDF), we obtain a sum of

Dirac functions. In order to obtain a smooth distribution, we convoluate this density
with a Gaussian kernel, then the PDF functions reads:
n
1X 1 2 2
pS (s) = √ e(s−si ) /2σ
n 2πσ
i=1
with σ is a smoothing parameter.

In order to test the numerical implementation, we employ the Probit model for th
simulated data which is generated in the same way as the hidden variable discussed
in the remark. Let (x1 , . . . , xn ) ∼ N 0, σ 2 be the data set with d factors (i.e.
xi ∈ Rd ). For all simulations, we took σ = 0.1. The binary response is given by the
following model:
Y0 = XT β0 + α0 + N (0, 1)
Y = I{Y0 >0}
Here, the parameters of the model α0 and β0 are chosen as α0 = 0.1 and β0 = 1. We
employ the Probit regression in order to determine the score of n = 500 data in the
cases d = 2 and d = 5. The comparisons between the Probit score and the simulated
score are presented in Figures 3.20-3.22
SVM score construction

We discuss now how to employ SVM to construct the score for a given ensemble of
the assets. In the work of G. Simon (2005), the SVM score is constructed by using
SVM-regression algorithm. In fact, with SVM-regression algorithm, we are able to
forecast the future performance E [µt+1 |Xt ] = µ̂t based on the present ensemble of
factor then this value can be employed directly as the prediction in a trend-following
strategy without need of score construction. We propose here another utilization
100
Figure 3.20: Comparison between simulated score and Probit score for d = 2
0.7
0.65
0.6
0.55
Score
0.5
0.45
0.4
0.35
Simulated score
Probit score
0 50 100 150 200 250 300 350 400 450 500
Assets
Figure 3.21: Comparison between simulated score CDF and Probit score CDF for
d=2
0.9
0.8
0.7
0.6
CDF
0.5
0.4
0.3
0.2
0.1
Simulated CDF
Probit CDF
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Score
101
Figure 3.22: Comparison between simulated score PDF and Probit score PDF for
d=2
6
Simulated PDF
Probit PDF
4
PDF
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Score
of SVM algorithm based on SVM-classification for building the scores which allow
later to implement long/short strategies by using the selection curves. Our main
idea of SVM-score construction is very similar to Probit model. We first define a
binary variable Yi = ±1 associated to each asset xi . This variable characterizes the
performance of the asset with respect to the benchmark. If Yi = −1, the stock is
underperformed whereas Yi = 1 the stock is overperformed. We next employ the
binary SVM-classification to separate the universal of stocks into two classes: high
performance and low performance. Finally, we define the score of each stock the its
distance to the boundary decision.
Selection curve
In order to construct a simple strategy of type long/short for example, we must be

able to establish a selection rule based on the score obtained by Probit model and
SVM regression. Depending on the strategy long, short or long/short, we expect to
build a selection curve which determine the portion of assets which have a certain
level of error. For a long strategy, we prefer to buy a certain portion of high perfor-
mance with the knowledge on the possible committed error. To do so, we define a
102
selection curve for which the score plays the role of the parameter:
Q (s) = P r (S ≥ s)
E (s) = P r (S ≥ s |Y = 0 )
∀ s ∈ [0, 1]
This parametric curve can be traced in the the square [0, 1] × [0, 1] as shown in
Figure 3.23. On the x-axis, Q (s) defines the quantile corresponding to the stock
selection among the considered universal of stocks. On the y-axis, E (s) defines
the committed error corresponding to the stock selection. Precisely, for a certain
quantile, it measures the chance that we pick the bad performance stock. Two
trivial limits are the points (0, 0) and (1, 1). The first point corresponds to the limit
with no selection whereas the second point corresponds to the limit with all selection.
A good score construction method should allow a selection curve as much convex as
possible because it guaranties a selection with less error.
Figure 3.23: Selection curve for long strategy for simulated data and Probit model
1
Simulated data
0.9 Probit model
0.8
0.7
P r(S > s|Y = 0)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P r(S > s)
Reciprocally, for a short strategy, the selection curve can be obtained by tracing
the following parametric curve:
Q (s) = P r (S ≤ s)
E (s) = P r (S ≤ s |Y = 1 )
∀ s ∈ [0, 1]
Here, Q (s) aims us to determine the quantile of low-performance stocks to be shorted
while E (s) helps us to avoid selling the high-performance one. As the selection
103
Figure 3.24: Probit scores for Eurostoxx data with d = 20 factors
0.9 Probit on Training

Probit on Validation
0.8
0.7
P r(S > s|Y = 0)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P r(S > s)
curve is independent of the score definition, it is an appropriate quantity to compare

different scoring techniques. In the following, we employ the selection curve for
comparing the score constructions of the Probit model and of the SVM-regression.
Figure 3.24 shows the comparison of the selection curves constructed by SVM score
and Probit score on the training set. Here, we did not effectuate any calibration on
the SVM parameters.
Backtesting and comparison
As presented in the last discussion on the regression, we have to build a cross valida-
tion procedure to optimize the SVM parameters. We follow the traditional routine
by dividing the data in three independent sets: (i)training set, (ii)validation set and
(iii)testing set. The classifier is obtained by the training set whereas its optimal pa-
rameters (C, σ) will be obtained by minimizing the fitting error on the validation set.
The efficiency of the SVM algorithm will be finally checked on the testing set. We
summarize the cross-validation procedure in the below algorithm. In order to make
the training set close to both validation data and testing data, we decide to divide
the data in the the following time order: validation set, training set and testing set.
Using this way, the prediction score on the testing set contains more information in
the recent past.
We now employ this procedure to compute the SVM score on the universal of
stocks of Eurostoxx index. Figure 3.25 present the construction of the score basing
on the the training set and validation set. The SVM parameters are optimized on
104
Algorithm 5 SVM score construction

procedure SVM_Score(X, y)
Divide data into training set Dtrain , validation set Dvalid and testing set Dtest
Classify the training data by using high/low performance criteria
Compute the decision boundary on Dtrain
Construct the SVM score on Dvalid by using the distance to the decision bound-
ary
Compute Error(σ, C) prediction error and classification error on Dvalid
Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? )
Use optimal parameters to compute the final SVM-score on testing set Dtest
end procedure
the validation set while the final score construction uses both training and validation
set in order to have largest data ensemble.
Figure 3.25: SVM scores for Eurostoxx data with d = 20 factors
1
SVM Training
0.9
SVM Validation
0.8 SVM Testing
0.7
P r(S > s|Y = 0)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P r(S > s)
3.7 Conclusion
Support vector machine is a well-established method with a very wide use in various
domain. In the financial point of view, this method can be used to recognize and
to predict the high performance stocks. Hence, SVM is a good indicator to build
efficients trading strategy over an universal of stocks. Within this paper, we first
have revisited the basic idea of SVM in both classification and regression contexts.
105
The extension to the case of multi-classification is also discussed in detail. Various

applications of this technique were introduced and discussed in detail. The first class
of applications is to employ SVM as forecasting method for time-series. We proposed
two applications: the first one consists of using SVM as a signal filter. The advan-
tage of the method is that we can calibrate the model parameter by using only the
available data. The second application is to employ SVM as a multi-factor regression
technique. It allows to refine the prediction with additional inputs such as economic
factors. For the second class of applications, we deal with SVM classification. Two
main applications that we discussed in the scope of this paper are the score construc-
tion and the sector recognition. Both resulting information are important to build
momentum strategies which are the core of the modern asset management.
106
Bibliography
[1] Allwein E. L. et al., (2000) , Reducing Multiclass to Binary: A Unifying

Approach for Margin Classifiers, Journal of Machine Learning Research, 1, pp.
113-141.
[2] At A. (2005), Optimisation d’un Score de Stock Screening, Rapport de stage-

ENSAE, Société Générale Asset Management.
[3] Basak D., Pal S. and Patranabis D.J. (2007), Support Vector Regression,
Neural Information Processing, 11, pp. 203-224.
[4] Ben-Hur A. and Weston J. (2010), A User’s Guide to Support Vector Ma-
chines, Methods In Molecular Biology Clifton Nj, 609, pp. 223-239.
[5] Burges C. J. C. (1998), A Tutorial on Support Vector Machines for Pattern

Recognition, Data Mining and Knowledge Discovery, 2, pp. 121-167.
[6] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive

Tuning and Prior Knowledge PhD thesis, Paris 6.
[7] Chapelle O. et al., (2002), Choosing Multiple Parameters for Support Vector
Machine, Machine Learning, 46, pp. 131-159.
[8] Chapelle O. (2007), Training a Support Vector Machine in the Primal, Journal
Neural Computation, 19, pp. 1155-1178.
[9] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learn-

ing, 20, pp. 273-297.
[10] Crammer K. and Singer Y. (2001), On the Algprithmic Implementation of

Multiclass Kernel-based Vector Machines, Journal of Machine Learning Re-
search, 2, pp. 265-292.
[11] Gestel T. V. et al., (2001), Financial Time Series Prediction Using Least
Squares Support Vector Machines Within the Evidence Framework, IEEE
Transactions on neural Networks, 12, pp. 809-820.
[12] Madzarov G. et al., (2009), A multi-class SVM Classifier Utilizing Binary

Decision Tree ,Informatica, 33, pp. 233-241.
107
[13] Milgram J. et al., (2009), “One Against One” or “One Against All”: Which One
is Better for Handwriting Recognition with SVMs? (2006) ,Tenth International
Workshop on Frontiers in Handwriting Recognition.
[14] Potluru V. K. et al., (2009), Efficient Multiplicative updates for Support

Vector Machines ,Proceedings of the 2009 SIAM Conference on Data Mining.
[15] Simon G. (2005), L’Econométrie Non Linéaire en Gestion Alternative, Rapport

de stage-ENSAE, Société Générale Asset Management.
[16] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Finan-
cial Times Series forecasting,Neurocomputing,48, pp. 847-861
[17] Tsochantaridis I. et al., (2004), Support Vector Machine Learning for Inter-
dependent and Structured Output Spaces,Proceedings of the 21 st International
Confer- ence on Machine Learning,Banff, Canada.
[18] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons,New York.
108
Chapter 4
Analysis of Trading Impact in the

CTA strategy
We review in this chapter the trend-following strategies within Kalman filter and
study the impact of the trend estimator error. We first study the case of momentum
strategy on uni-asset case then generalize the analysis to the multi-asset case. In
order to construct the allocation strategy, we employ the observed trend which is
filtered by exponential moving average. It can be demonstrated that the cumulated
return of the strategy can be broken down into two important parts: the option profile
which is similar in concept to the straddle profile suggested by Fung and Hsied (2001)
and the trading impact which involves directly the estimator error on the efficiency of
strategy. We focus in this paper on the second quantity by estimating its probability
distribution function and associated gain and loss expectations. We illustrate how the
number of assets and their correlations influence on the performance of a strategy
via a “toy model”. This study can reveal important results which can be directly
tested on CTA fund such as the “Epsilon ”fund.
Keywords: CTA, Momentum strategy, Trend following, Kalman filter, Trading

impact, Chi-square distribution.
4.1 Introduction
Trend-following strategies are specific example of an investment style that emerged
as an industry recently. They are so-called Commodity Trading Advisors (CTA)
and play an important role in the Hedge Fund industry (15% of total Hedge Fund
AUMs). Recently, this investment style has been carefully reviewed and analyzed
the 7th White Paper of Lyxor edition. We present here a complementary result
of this nice paper and give more specific analysis on a typical CTA. We will focus
here on the trading impact by estimating its probability distribution function and
associated gain and loss expectations. We illustrate how the number of assets and
their correlations influence on the performance of a strategy via a “Toy model”. This
109
Analysis of Trading Impact in the CTA strategy
study can reveal important results which can be directly tested on CTA fund such
as the “Epsilon ”fund.
This chapter is organized as following. In the first part, we remind the main result
of trend-following strategy in the univariate case which has been demonstrated in
the 7th White Paper of Lyxor. We next generalize this result into the multivariate
case which establishes a framework for studying the impacts of the correlation and
the number of assets in a CTA fund. Finally, we finish with the study of a toy model
which permits to understand the efficiency of trend-following strategy.
4.2 Conclusion
Momentum strategies are efficient ways to use the market tendency for building trad-
ing strategies. Hence, a good estimator of the trend is essential from this perspective.
In this paper, we study the impact of estimator error on a trend-following strategy
both in the single asset case and multi-asset case. The objective of this paper is
twofold. First, we have establish the general framework for analyzing a CTA fund.
Second, we illustrate important results of the trading impact on CTA strategy via a
simple “Toy Model ”. We have shown that the gain probability and gain expectation
depend strongly on the correlation and the number of assets. Increasing the number
of asset can help to improve the performance and reduce the risk (volatility) within
a momentum strategy. However, when the number of asset reaches certain limit, we
observe a saturation of performance. It implies that above this limit, putting more
assets does not improve too much the performance but it does make the strategy
more complicate and increase the management cost as the portfolio is rebalanced
frequently. The correlation of between assets play an important role as well. As
usual, the higher correlation level is, the less efficient strategies are. Interestingly,
we remark that when the correlation increases, we approach the limit of single asset
in which the gain probability is small than the loss probability but the conditional
expectation of gain is much higher than the conditional expectation of loss.
110
Bibliography
[1] Al-Naffouri T. Y. Babak H. (2009), On the Distribution of Indefinite

Quadratic Forms in Gaussian Random Variables, Information Theory, pp. 1744
- 1748 .
[2] Davies R. B.(1973), Numerical Inversion of Characteristic Function,

Biometrika, 60, pp. 415-417.
[3] Davies R. B. (1980), The Distribution of a Linear Combination of χ2 Random

Variables, Applied Statistics, 29, pp. 323-333.
[4] Imhoff J. P.(1961), Computing the Distribution of Quadratic Form in Normal

variables, Biometrika, 48, pp. 419-426.
[5] Khatri C. G.(1978), A remark on the necessary and sufficient conditions for a
quadratic form to be distributed as a chi-square, Biometrika, 65, pp. 239-240.
[6] Kotz S., Johnson N.L. and Boyd D.W. (1967), Series Representations of
Distributions of Quadratic Forms in Normal Variables II. Non-Central Case,
The Annals of Mathematical Statistics, 38, pp. 838-848.
[7] Murison R. (2005), Distribution theory and inference, School of Science and
Technology , ch. 6, pp. 86-88.
[8] Ruben H.(1962), Probability Content of Regions Under Spherical Normal

Distributions, IV: The Distribution of Homogeneous and Non-Homogeneous
Quadratic Functions of Normal Variables, The Annals of Mathematical Statis-
tics, 33, pp. 542-570.
[9] Ruben H.(1962), A New Result on the Distribution of Quadratic Forms, The
Annals of Mathematical Statistics, 34, pp. 1582-1584.
[10] Shah B.K. (1963) Distribution of Definite and of Indefinite Quadratic Forms
from a Non- Central Normal Distribution, The Annals of Mathematical Statis-
tics, 34, pp. 186-190.
[11] Shah B.K. and Khatri C.G. (1961) Distribution of a Definite Quadratic Form
for Non-Central Normal Variates, The Annals of Mathematical Statistics, 32,
pp. 883-887.
111
[12] Tziritas G. G.(1987), On the Distribution of Positive-definite Gaussian

Quadratic Forms, IEEE Transtractions on Information Theory, 33, pp. 895-
906.
112
Conclusions
Within the internship in the R&D team of Lyxor Asset Management, I had chance
to work on many interesting topics concerning the quantitative asset management.
Beyond of this report, the resutls obtained during the stay have been employed for
the 8th edition of the Lyxor White Paper communication. The main results of this
intership can be divided into three grand lines. The first results consists of improving
the trend and volatility estimations which are important quantities for implementing
dynamical strategies. The second main results concern the application of the machine
learning technology in finance. We expect to employ the “Support vector machine”
for forcasting the expected return of financial assets and for having a criterial for
stock selection. The third main result is devoted for the analysis of the performance
of trend-following strategy (CTA) in the general case. It consists of studying the
efficiency of CTA within the changes in the market such as the correlation between
the assets, or their performance.
In the first part, we focused on improving the trend and volatility estimations in
order to implement two crucial momentum-strategies: trend-following and voltarget.
We show that we can use L1 filters to forecast the trend of the market in a very
simple way. We also propose a cross-validation procedure to calibrate the optimal
regularization parameter λ where the only information to provide is the investment
time horizon. More sophisticated models based on a local and global trends is also
discussed. We remark that these models can reflect the effect of mean-reverting to
the global trend of the market. Finally, we consider several backtests on the S&P 500
index and obtain competing results with respect to the traditional moving-average
filter. On another hand, voltarget strategies are efficient ways to control the risk
for building trading strategies. Hence, a good estimator of the volatility is essential
from this perspective. In this report, we present the improvement on the forecasting
of volatility by using some novel technologies. The use of high and low prices is
less important for the index as it gives more and less the same result with tradi-
tional close-to-close estimator. However, for independent stock with higher volatility
level, the high-low estimators improves the prediction of volatility. We consider sev-
eral backtests on the S&P 500 index and obtain competing results with respect to
the traditional moving-average estimator of volatility. Indeed, we consider a simple
stochastic volatility model which permit to integrate the dynamics of the volatility in
the estimator. An optimization scheme via the maximum likehood algorithm allows
us to obtain dynamically the optimal averaging window. We also compare these
results for range-based estimator with the well-known IGARCH model. The com-
parison between the optimal value of the likehood functions for various estimators
gives us also a ranking of estimation error. Finally, we studied the high frequency
volatility estimator which is a very active topic of financial mathematics. Using sim-
ple model proposed by Zhang et al, (2005), we show that the microstructure noise
can be eliminated by the two time scale estimator.
Support vector machine is a well-established method with a very wide use in

various domain. In the financial point of view, this method can be used to recognize
and to predict the high performance stocks. SVM is a good indicator to build effi-
cient trading strategies over a stocks universe. Within the second part of this report,
we first have revisited the basic idea of SVM in both classification and regression
contexts. The extension to the case of multi-classification is also discussed in detail.
Various applications of this technique were introduced and discussed in detail. The
first class of applications is to employ SVM as forecasting method for time-series.
We proposed two applications: the first one consists of using SVM as a signal filter.
The advantage of the method is that we can calibrate the model parameter by using
only the available data. The second application is to employ SVM as a multi-factor
regression technique. It allows to refine the prediction with additional inputs such as
economic factors. For the second class of applications, we deal with SVM classifica-
tion. Two main applications that we discussed in the scope of this paper are the score
construction and the sector recognition. Both resulting information are important
to build momentum strategies which play an important role in Lyxor quantitative
management.
Finally, we have realized a detailled analysis on the performance of trend-following

strategy in order to understand its important role in the risk diversification and in
optimizing the absolute return. In the third part, we studied the impact of estimator
error and market parameters such as the correlation and the average performance
of individual stocks on a trend-following strategy both in the single asset and multi-
asset cases. The objective of this chapter is two-fold. First, we have establish the
general framework for analyzing a CTA fund. Second, we illustrate important results
of the trading impact on CTA strategy via a simple “toy model ”. We have shown
that the gain probability and gain expectation depend strongly on the correlation
and the number of assets. Increasing the number of asset can help to improve the
performance and reduce the risk (volatility) within a momentum strategy. However,
when the number of asset reaches certain limit, we observe a saturation of perfor-
mance. It implies that above this limit, putting more assets does not improve the
performance very much but it does make the strategy more complicate and increase
the management cost as the portfolio is rebalanced frequently. The correlation be-
tween assets play an important role as well. As usual, the higher correlation level
is, the less efficient strategies are. Interestingly, we remark that when the correla-
tion increases, we approach the limit of single asset in which the gain probability is
smaller than the loss probability but the conditional expectation of gains is much
higher than the conditional expectation of losses.
114
Appendix A
Appendix of chaper 1
A.1 Computational aspects of L1 , L2 filters

A.1.1 The dual problem
The L1 − T filter
This problem can be solved by considering the dual problem which is a QP program.
We first rewrite the primal problem with new variable z = Dx:
1
min ky − xk22 + λ kzk1
2
u.c. z = Dx
We construct now the Lagrangian function with the dual variable ν ∈ Rn−2 :
1
L (x, z, ν) = ky − xk22 + λ kzk1 + ν > (Dx − z)
2
The dual objective function is obtained in the following way:
1
inf x,z L (x, z, ν) = − ν > DD> ν + y > D> ν
2
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is
equivalent to the dual problem:
1 >
min ν DD> ν − y > D> ν
2
u.c. −λ1 ≤ ν ≤ λ1
This QP program can be solved by traditional Newton algorithm or by interior-point

methods, and the final solution of the trend reads
x? = y − D > ν
115
The L1 − C filter
The optimization procedure for L1 − C filter follows the same strategy as the L1 − T
filter. We obtain the same quadratic program with the D operator replaced by
(n − 1) × n matrix which is the discrete version of the first order derivative:
 
−1 1 0
 0 −1 1 0 
 
 . . 
D=
 . 

 −1 1 0 
−1 1
The L1 − T C filter
In order to follow the same strategy presented above, we introduce two additional
variables z1 = D1 x and z2 = D2 x. The initial problem becomes:
1
min ky − xk22 + λ1 kz1 k1 + λ2 kz2 k1
2

z 1 = D1 x
u.c.
z 2 = D2 x
The Lagrangian function with the dual variables ν1 ∈ Rn−1 and ν2 ∈ Rn−2 is:
1
L (x, z1 , z2 , ν1 , ν2 ) = ky − xk22 +λ1 kz1 k1 +λ2 kz2 k1 +ν1> (D1 x − z1 )+ν2> (D2 x − z2 )
2
whereas the dual objective function is:
1
2

inf x,z1 ,z2 L (x, z1 , z2 , ν1 , ν2 ) = − D1> ν1 + D2> ν2 + y > D1> ν1 + D2> ν2
2 2
for −λi 1 ≤ νi ≤ λi 1 (i = 1, 2). Introducing the variable z = (z1 , z2 ) and ν = (ν1 , ν2 ),

the initial problem is equivalent to the dual problem:
1 >
ν Qν − R> ν
min
2
u.c. −ν + ≤ ν ≤ ν +

D1 λ1
with D = , Q = DD , R = Dy and ν =
> + 1. The solution of the
D2 λ2
primal problem is then given by x? = y − D> ν.
The L1 − T multivariate filter

As in the univariate case, this problem can be solved by considering the dual problem
which is a QP program. The primal problem is:
1 X 2
m
(i)
min y − x + λ kzk1
2 2
i=1
u.c. z = Dx
116
Pm
Let us define ȳ = (ȳt ) with ȳt = m−1 i=1 y
(i) . The dual objective function becomes:
1 X (i) >
m
1
inf x,z L (x, z, ν) = − ν > DD> ν + ȳ > D> ν + y − ȳ y (i) − ȳ
2 2
i=1
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is
equivalent to the dual problem:
1 >
min ν DD> ν − ȳ > D> ν
2
u.c. −λ1 ≤ ν ≤ λ1
This QP program can be solved by traditional Newton algorithm or by interior-point

methods and the solution is:
x? = ȳ − D> ν
A.1.2 The interior-point algorithm

We present briefly the interior-point algorithm of Boyd and Vandenberghe (2009) in
the case of the following optimization problem:
min f0 (x)

Ax = b
u.c.
fi (x) < 0 for i = 1, . . . , m
where f0 , . . . , fm : Rn → R are convex and twice continuously differentiable and

rank (A) = p < n. The inequality constraints will become implicit if one rewrite the
problem as:
m
X
min f0 (x) + I− (fi (x))
i=1
u.c. Ax = b
where I− (u) : R → R is the non-positive indicator function1 . This indicator func-

tion is discontinuous, hence the Newton method can not be applied. In order to
overcome this problem, we approximate I− (u) by the logarithmic barrier function
I−? (u) = −τ −1 ln (−u) with τ → ∞. Finally the Kuhn-Tucker condition for this
approximation problem gives rt (x, λ, ν) = 0 with:

 
∇f0 (x) + ∇f (x)> λ + A> ν
rτ (x, λ, ν) =  − diag (λ) f (x) − τ −1 1 
Ax − b
1
We have:
0 u≤0
I− (u) =
∞ u>0
117
The solution of rτ (x, λ, ν) = 0 can be obtained by Newton’s iteration for the triple
y = (x, λ, ν):
rτ (y + ∆y) ' rτ (y) + ∇rτ (y) ∆y = 0
This equation gives the Newton’s step ∆y = −∇rτ (y)−1 rτ (y) which defines the
search direction.
A.1.3 The scaling of smoothing parameter of L1 filter

We can try to estimate the order of magnitude of the parameter λmax by considering
the continuous case. Assuming that the signal is a process Wt . The value of λmax in
the discrete case defined by:
−1

λmax = DD >
Dy

∞
RT
can be considered as the first primitive I1 (T ) = 0 Wt dt of the process Wt if D = D1
RT Rt
(L1 − C filtering) or the second primitive I2 (T ) = 0 0 Ws ds dt of Wt if D = D2
(L1 − T filtering). We have:
Z T
I1 (T ) = Wt dt
0
Z T
= WT T − t dWt
0
Z T
= (T − t) dWt
0
The process I1 (T ) is a Wiener integral (or a Gaussian process) with variance:
Z T
T3
E I12 (T ) = (T − t)2 dt =
0 3
In this case, we expect that λmax ∼ T 3/2 . The second order primitive can be calcu-
lated in the following way:
Z T
I2 (T ) = I1 (t) dt
0
Z T
= I1 (T ) T − t dI1 (T )
0
Z T
= I1 (T ) T − tWt dt
0
Z T 2
T2 t
= I1 (T ) T − WT + dWt
2 0 2
Z T
T2 t2
= − WT + T2 − Tt + dWt
2 0 2
Z
1 T
= (T − t)2 dWT
2 0
118
This quantity is again a Gaussian process with variance:

Z T
1 T5
E[I22 (T )] = (T − t)4 dt =
4 0 20
In this case, we expect that λmax ∼ T 5/2 .
A.1.4 Calibration of the L2 filter

We discuss here how to calibrate the L2 filter in order to extract the trend with
respect to the investment time horizon T . Though the L2 filter admits an explicit
solution which is a great advantage for numerical implementation, the calibration of
the smoothing parameter λ is not trivial. We propose to calibrate the L2 filter by
comparing the spectral density of this filter with the one obtained with the moving-
average filter. For this last filter, we have:
t−1
1 X
x̂MA
t = yi
T
i=t−T
It comes that the spectral density is:

2
1 TX
−1

f (ω) = 2 e−iωt
T
t=0
−1
For the L2 filter, we k now that the solution is x̂HP = 1 + 2λDT D y. Therefore,
the spectral density is:
2
1
f HP (ω) =
1 + 4λ (3 − 4 cos ω + cos 2ω)
2
1
'
1 + 2λω 4
The width of the spectral density for the L2 filter is then (2λ)−1/4 whereas it is 2πT −1
for the moving-average filter. Calibrate the L2 filter could be done by matching this
two quantities. Finally, we obtain the following relationship:
4
1 T
λ ∝ λ? =
2 2π
In Figure A.1, we represent the spectral density of the moving-average filter for dif-
ferent windows T . We report also the spectral density of the corresponding L2 filters.
For that, we have calibrated the optimal parameter λ? by least square minimization.
In Figure A.2, we compare the optimal estimator λ? with the one corresponding to
10.27 × λ? . We notice that the approximation is very good.
119
Figure A.1: Spectral density of moving-average and L2 filters
Figure A.2: Relationship between the value of λ and the length of the moving-average
filter
120
A.1.5 Implementation issues

The computational time may be large when working with dense matrices even if
we consider interior-point algorithms. It could be reduced by using sparse matrices.
But the efficient way to optimize the implementation is to consider band matrices.
Moreover, we may also notice that we have to solve a large linear system at each
iteration. Depending on the filtering problem (L1 − T , L1 − C and L1 − T C filters),
the system is 6-bands or 3-bands but always symmetric. For computing λmax , one
may remark that it is equivalent to solve a band system which is positive definite.
We suggest to adapt the algorithms in order to take into account all these properties.
121
Appendix B
Appendix of chapter 2
B.1 Estimator of volatility

B.1.1 Estimation with realized return
We consider only one return, then the estimator of volatility can be obtained as
following:
Z ti Z ti !2
2
2 1 2
Rti = ln Sti − ln Sti−1 = σu dWu + µu du − σu du
ti−1 ti−1 2
The conditional expectation with respect to the couple σu and µu which are supposed
to be independent to dWu is given by:
Z ti Z ti !2
2 2 1 2
E Rti |σ, µ = σu du + µu du − σu du
ti−1 ti−1 2
which is approximatively equal to:

2
2 1 2
(ti − ti−1 ) σt2i−1 + (ti − ti−1 ) µti−1 − σti−1
2
The variance of this estimator characterizes the error and reads:

Z ti !2 
1
var Rt2i |σ, µ = var  σu dWu + µu du − σu2 du σ, µ
ti−1 2
R ti
As the conditional expectation of ti−1 σ u dW u + µ u du − 1 2
2 σu du with respect to
R ti
σ et µ is a Gaussian variable of mean value ti−1 µu du − 2 σu du and variance
1 2
R ti
ti−1 σu du. Therefore, we obtain the variance of the estimator:
2
Z !2 Z ! Z !2
ti ti ti
1 2
var Rt2i |σ, µ = 2 σu2 du +4 σu2 du µu du − σu du (B.1)
ti−1 ti−1 ti−1 2
123
which is approximatively equal to:

2
2 3 1 2
2 (ti − ti−1 ) σt4i−1 + 4 (ti − ti−1 ) σt2i−1 µu du − σu du
2
We remark that when the time step (ti√− ti−1 ) becomes small, the estimator becomes
unbiased with its standard deviation 2 (ti − ti−1 ) σt2i−1 . This error is directly pro-
portional to the quantity to be estimated.
In order to estimate the average variance between t0 and tn or the approached

volatility at tn , we can employ the canonical estimator
n
X n
X 2
Rt2i = ln Sti − ln Sti−1
i=1 i=1
The expectation value of this estimator reads

" n # Z Z ti !2
X tn n
X 1 2
2 2
E Rti σ, µ = σu du + µu du − σu du
t0 ti−1 2
i=1 i=1
We observe that his estimator is weakly biased, however this effect is totally neg-
ligible. If we consider a volatility of 20% with a trend of 10%, the estimation of
volatility is 20.006% instead of 20%.
The variance of the canonical estimator (estimation error) reads:
n Z !2 Z ! Z !2
X ti ti ti
1 2
2 σu2 du +4 σu2 du µu du − σu du
ti−1 ti−1 ti−1 2
i=1
which can be roughly estimated by:

n Z !2 n
X ti X
2 σu2 du ≈ 2σ 4 (ti − ti−1 )2
i=1 ti−1 i=1
If the recorded time ti are regularly distributed with time-spacing ∆t, then we have:
!
Xn

var Rt2i σ, µ ≈ 2σ 4 (tn − t0 ) ∆

i=1
124
Appendix C
Appendix of chapter 3
C.1 Dual problem of SVM

In the traditional approach, the SVM problem is first mapped to the dual problem
then is solved by a QP program. We present here the detail derivation of the dual
problem in both hard-margin SVM and soft-margin SVM case.
C.1.1 Hard-margin SVM classifier

Let us start first with the hard-margin SVM problem for the classification:
1
min kwk2
w,b 2

u.c. yi wT xi + b ≥ 1 i = 1...n
In order to get the dual problem, we construct the Lagrangian for inequality con-
strains by introducing positive Lagrange multipliers Λ = (α1 , . . . , αi ) ≥ 0:
n
X n
1 X
L (w, b, Λ) = kwk2 − α i yi w T x i + b + αi
2
i=1 i=1
In minimizing the Lagrangian with respect to (w, b), we obtain the following equa-
tions:
X n
∂L
= w − α i yi xi = 0
∂wT
i=1
n
X
∂L
=− α i yi = 0
∂b
i=1
Insert these results into the Lagrangian, we obtain the dual objective LD function
with respect to the variable w:
1
LD (Λ) = ΛT 1 − ΛT DΛ
2
125
with Dij = yi yj xTi xj and the constrains ΛT y = 0 and Λ ≥ 0. Thank to the

KKT theorem, the initial optimization problem is equivalent to maximizing the dual
objective function LD (Λ)
1
max ΛT 1 − ΛT DΛ
Λ 2
u.c. ΛT y = 0, Λ ≥ 0
C.1.2 Soft-margin SVM classifier
We turn now to the soft-margin SVM classifier with L1 constrain case F (u) = u, p =
1. We first write down the primal problem:
n
!
1 X
min kwk2 + C.F ξip
w,b,ξ 2 i=1
T

u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n
For both case, we construct Lagrangian by introducing the couple of Lagrange mul-
tiplier (Λ, µ) for 2n constraints.
n
! n n
1 X X X
L (w, b, Λ, µ) = kwk2 + C.F ξi − αi yi wT xi + b − 1 + ξi − µi ξi
2
i=1 i=1 i=1
with the following constraints on the Lagrange multipliers Λ ≥ 0 and µ ≥ 0. Mini-

mizing the Lagrangian with respect to (w, b, ξ) gives us:
X n
∂L
= w − α i yi xi = 0
∂wT
i=1
n
X
∂L
=− α i yi = 0
∂b
i=1
∂L
=C −Λ−µ=0
∂ξ
with inequality constraints Λ ≥ 0 and µ ≥ 0. Insert these results into the Lagrangian
leads to the dual problem:
1
max ΛT 1 − ΛT DΛ (C.1)
Λ 2
T
u.c. Λ y = 0, 0 ≤ Λ ≤ C1
126
C.1.3 ε-SV regression

We study here the ε-SV regression. We first write down the primal problem with all
constrains:
n
!
1 X
min kwk2 + C ξi
w,b,ξ 2 i=1
T
u.c. w xi + b − yi ≤ ε + ξi
yi − wT xi − b ≤ ε + ξi0
ξi ≥ 0 ξi0 ≥ 0 i = 1...n
In this case, we have 4n inequality constrain. Hence, we construct Lagrangian by

introducing the positive Lagrange multipliers (Λ, Λ0 , µ, µ0 ). The Lagrangian of this
primal problem reads:
n
! n n
1 2
X X X
0
L w, b, Λ, Λ , µ = kwk + C.F ξi − µi ξi − µ0i ξi0
2
i=1 i=1 i=1
n
X n
X

− αi wT φ (xi ) + b − yi + ε + ξi − βi −wT φ (xi ) − b + yi + ε + ξi0
i=1 i=1
with Λ = (αi )i=1...n , Λ0 = (βi )i=1...n and the following constraints on the Lagrange
multipliers Λ, Λ0 , µ, µ0 ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ)
gives us:
X n
∂L
= w − (αi − βi ) yi xi = 0
∂wT
i=1
n
X
∂L
= (βi − αi ) yi = 0
∂b
i=1
∂L
= CI − Λ − µ = 0
∂ξ
∂L
= CI − Λ0 − µ0 = 0
∂ξ 0
Insert these results into the Lagrangian leads to the dual problem:
T T 1 T
max Λ − Λ0 y − ε Λ + Λ0 1− Λ − Λ0 K Λ − Λ0 (C.2)
Λ,Λ0 2
T
u.c. Λ − Λ0 1 = 0, 0 ≤ Λ, Λ0 ≤ C1
T
When ε = 0, the term ε (Λ + Λ0 ) 1 in the objective function disappears, then we
can reduce the optimization problem by changing variable (Λ − Λ0 ) → Λ. The
inequality constrain for new variable reads |Λ| < CI.
127
The dual problem can be solved by the QP program which gives the optimal
solution Λ? . In order to compute b, we use the KKT condition:

αi wT φ (xi ) + b − yi + ε + ξi = 0

βi yi − wT φ (xi ) − b + ε + ξi = 0
(C − αi ) ξi = 0
(C − βi ) ξi0 = 0
We remark that the two last conditions give us: ξi = 0 for 0 < αi < C and ξi0 = 0
for 0 < βi < C. This result implies direclty the following condition for all support
vectors of training set (xi , yi ):
wT φ (xi ) + b − yi = 0
Pn
We denote here SV the set of support vectors. Using the condition w = i=1 (αi − βi ) φ (xi )
and averaging over the training set, we obtain finally:
nSV
1 X
b= (yi − (z)i ) = 0
nSV
i
with z = K (Λ − Λ0 ).
C.2 Newton optimization for the primal problem

We consider here the Newton optimization scheme for solving the unconstrainted
primal problem:
Xn
1
min LP (β, b) = min β T Kβ + C L yi , KiT β + b
β ,b β ,b 2 i=1
The required condition of this scheme is that the function L (y, t) is differentiable.
We study first the case of quadratic loss where L (y, t) is differentiable then the case
with soft-margin where we have to regularize L (y, t).
C.2.1 Quadratic loss function

For the quadratic loss case, the penalty function has a suitable form:
L (yi , f (xi )) = max (0, 1 − yi f (xi ))2
This function is differentiable everywhere and its derivative reads:

∂L
(y, t) = 2y (yt − 1) I{yt≤1}
∂t
However, the second derivative is not defined at the point yt = 1. In order to avoid
this problem, we consider directly the function L as a function of the vector β and
128
perform a quasi-Newton optimization. The second derivative now is replaced by an

approximation of the Hessian matrix. The gradient of the objective function with
respect to the vector (bβ)T is given as following:
T 0
2C1T I0 1 2C1T I0 K b 1 I y
∇LP = − 2C
2CK T I0 1 K + CKI0 K β KI0 y
and the pseudo-Hessian matrix is given by:

2C1T I0 1 2C1T I0 K
H=
2CKI0 1 K + 2CKI0 K
Then the Newton iteration consists of updating the vector (bβ)T until convergence
as following:
b b
← − γH −1 ∇LP
β β
C.2.2 Soft-margin SVM

For the soft-margin case, the penalty function has the following form
L (yi , f (xi )) = max (0, 1 − yi f (xi ))
which requires a regularization. A differentiable approximation is to use the following

penalty function:


 0 if yt > 1 + h
2
(1+h−yt)
L (y, t) = if |1 − yt| ≤ h

 1− 4h
yt if yt < 1 − h
129
Published paper in the Lyxor White Paper Series:
Trend Filtering Methods For

Momentum Strategies
Lyxor White Paper Series, Issue # 8, December 2011
http://www.lyxor.com/fr/publications/white-papers/wp/52/
December
2 0 11
Issue #8
W H I T E PA PE R
T R E N D F I LT E R I N G
METHODS FOR
M O M E N T U M S T R AT E G I E S
Benjamin Bruder Tung-Lam Dao Jean-Charles Richard Thierry Roncalli

Research & Development Research & Development Research & Development Research & Development
Lyxor Asset Management, Paris Lyxor Asset Management, Paris Lyxor Asset Management, Paris Lyxor Asset Management, Paris
benjamin.bruder@lyxor.com tung-lam.dao@lyxor.com jean-charles.richard@lyxor.com thierry.roncalli@lyxor.com
Issue # 8 T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S
Foreword
The widespread endeavor to “identify” trends in market prices has given rise to a signif-
icant amount of literature. Elliott Wave Principles, Dow Theory, Business cycles, among
many others, are common examples of attempts to better understand the nature of market
prices trends.
Unfortunately this literature often proves frustrating. In their attempt to discover new
rules, many authors eventually lack precision and forget to apply basic research methodology.
Results are indeed often presented without any reference neither to necessary hypotheses nor
to confidence intervals. As a result, it is difficult for investors to find there firm guidance
and to differentiate phonies from the real McCoy.
This said, attempts to differentiate meaningful information from exogenous noise lie at
the core of modern Statistics and Time Series Analysis. Time Series Analysis follows similar
goals as the above mentioned approaches but in a manner which can be tested. Today more
than ever, modern computing capacities can allow anybody to implement quite powerful
tools and to independently tackle trend estimation issues. The primary aim of this 8th
White Paper is to act as a comprehensive and simple handbook to the most
widespread trend measurement techniques.
Even equipped with refined measurement tools, investors have still to remain wary about
their representation of trends. Trends are sometimes thought about as some hidden force
pushing markets up or down. In this deterministic view, trends should persist.
However, random walks also generate trends! Five reds drawn in a row from a non
biased roulette wheel do not give any clue about the next drawn color. It is just a past trend
with nothing to do with any underlying structure but a mere succession of independent
events. And the bottom line is that none of those two hypotheses can be confirmed or
dismissed with certainty.
As a consequence, overfitting issues constitute one of the most serious pitfalls in applying
trend filtering techniques in finance. Designing effective calibration procedures reveals to be
as important as the theoretical knowledge of trend measurement theories. The practical
use of trend extraction techniques for investment purposes constitutes the other
topic addressed in this 8th White Paper.
Nicolas Gaussel
Global Head of Quantitative Asset Management
Q U A N T R E S E A R C H B Y LY X O R
1
2
Executive Summary
Introduction
The efficient market hypothesis implies that all available information is reflected in current
prices, and thus that future returns are unpredictable. Nevertheless, this assumption has
been rejected in a large number of academic studies. It is commonly accepted that financial
assets may exhibit trends or cycles. Some studies cite slow-moving economic variables related
to the business cycle as an explanation for these trends. Other research argues that investors
are not fully rational, meaning that prices may underreact in the short run and overreact at
long horizons.
Momentum strategies try to benefit from these trends. There are two opposing types:
trend following and contrarian. Trend following strategies are momentum strategies in which
an asset is purchased if the price is rising, while in the contrarian strategy assets are sold
if the price is falling. The first step in both strategies is trend estimation, which is the
focus of this paper. After a review of trend filtering techniques, we address practical issues,
depending on whether trend detection is designed to explain the past or forecast the future.
The principles of trend filtering

In time series analysis, the trend is considered to be the component containing the global
change, which contrasts with local changes due to noise. The separation between trend and
noise has a long mathematical history, and continues to be of great interest to the scientific
community. There is no precise definition of the trend, but it is generally accepted that it
is a smooth function representing long-term movement. Thus, trends should exhibit slow
change, while noise is assumed to be highly volatile.
The simplest trend filtering method is the moving average filter. On average, the noisy
parts of observations tend to cancel each other out, while the trend has a cumulative nature.
But observations can be averaged using many different types of weightings. More generally,
the different averages obtained are referred to as linear filtering. Several examples repre-
senting trend filtering for various linear filters are shown in Figure 1. In this example, the
averaging horizon (65 business days or one year) has much more influence than the type of
averaging.
Other trend following methods, which are classified as nonlinear, use more complex
calculations to obtain more specific results (such as filters based on wavelet analysis, support
vector machines or singular spectrum analysis). For instance, the L1 filter is designed to
obtain piecewise constant trends, which can be interpreted more easily.
3
Figure 1: Trend estimate of the S&P 500 index
Variations around a benchmark estimator

Trend filtering can be performed either to explain past behaviour of asset prices, or to
forecast future returns. The choice of the estimator and its calibration primarily depend
on that objective. If the goal is to explain past price behaviour, there are two possible
approaches. The first is to select the model and parameters that minimise past prediction
error. This can be performed using a cross-validation procedure, for example. The second
option is to consider a benchmark estimator, such as the six-month moving average, and to
calibrate another model to be as close to the benchmark as possible. For instance, the L1
filter of Figure 2 is calibrated to deliver a constant trend over an average six-month period.
This type of filter is more easily interpreted than the original six-month moving average,
with clearly delimited trend periods. This procedure can be performed on any time series.
From trend filtering to forecasting

Trend filtering may also be a predictive tool. This is a much more ambitious objective.
It supposes that the last observed trend has an influence on future asset returns. More
precisely, trend following predictions suppose that positive (or negative) trends are more
likely to be followed by positive (or negative) returns. Any trend following method would
be useless if this assumption did not hold.
Figure 3 illustrates that the distributions of the one-month GSCI index returns after
a very positive three-month trend (i.e. above a threshold) clearly dominate the return
distribution after a very negative trend (i.e. below the threshold).
4
Figure 2: L1 versus moving average filtering
Figure 3: Distribution of the conditional standardised monthly return
5
Furthermore, this persistence effect is also tested in Table 1 for a number of major
financial indices. This table compares the average one-month return following a positive
three-month trend period to the average one-month return following a negative three month
trend period.
Table 1: Average one-month conditional return based on past trends
Trend Positive Negative Difference

Eurostoxx 50 1.1% 0.2% 0.9%
S&P 500 0.9% 0.5% 0.4%
MSCI WORLD 0.6% −0.3% 1.0%
MSCI EM 1.9% −0.3% 2.2%
TOPIX 0.4% −0.4% 0.9%
EUR/USD 0.2% −0.2% 0.4%
USD/JPY 0.2% −0.2% 0.4%
GSCI 1.3% −0.4% 1.6%
On average, for all indices under consideration, returns are higher after a positive trend than
after a negative one. Thus, the trends are persistent, and seem to have a predictive value.
This makes the case for the study of trend following strategies, and highlights the appeal of
trend filtering methods.
Conclusion
The ultimate goal of trend filtering in finance is to design portfolio strategies that may benefit
from the identified trends. Such strategies must rely on appropriate trend estimators and
time horizons. This paper highlights the variety of estimators available in the academic
literature. But the choice of trend estimator is just one of the many questions that arises
in the definition of those strategies. In particular, diversification and risk budgeting are key
aspects of success.
6
Table of Contents
1 Introduction 9
2 A review of econometric estimators for trend filtering 10

2.1 The trend-cycle model . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Linear filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Nonlinear filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Multivariate filtering . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Trend filtering in practice 30

3.1 The calibration problem . . . . . . . . . . . . . . . . . . . . . . 30
3.2 What about the variance of the estimator? . . . . . . . . . . . . 33
3.3 From trend filtering to trend forecasting . . . . . . . . . . . . . 38
4 Conclusion 40
A Statistical complements 41
A.1 State space model and Kalman filtering . . . . . . . . . . . . . . 41
A.2 L1 filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.3 Wavelet analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A.4 Support vector machine . . . . . . . . . . . . . . . . . . . . . . 47
A.5 Singular spectrum analysis . . . . . . . . . . . . . . . . . . . . . 50
7
8
Trend Filtering Methods

for Momentum Strategies∗
Benjamin Bruder Tung-Lam Dao
Research & Development Research & Development
Lyxor Asset Management, Paris Lyxor Asset Management, Paris
benjamin.bruder@lyxor.com tung-lam.dao@lyxor.com
Jean-Charles Richard Thierry Roncalli
Research & Development Research & Development
Lyxor Asset Management, Paris Lyxor Asset Management, Paris
jean-charles.richard@lyxor.com thierry.roncalli@lyxor.com
December 2011
Abstract
This paper studies trend filtering methods. These methods are widely used in mo-
mentum strategies, which correspond to an investment style based only on the history
of past prices. For example, the CTA strategy used by hedge funds is one of the
best-known momentum strategies. In this paper, we review the different econometric
estimators to extract a trend of a time series. We distinguish between linear and non-
linear models as well as univariate and multivariate filtering. For each approach, we
provide a comprehensive presentation, an overview of its advantages and disadvantages
and an application to the S&P 500 index. We also consider the calibration problem of
these filters. We illustrate the two main solutions, the first based on prediction error,
and the second using a benchmark estimator. We conclude the paper by listing some
issues to consider when implementing a momentum strategy.
Keywords: Momentum strategy, trend following, moving average, filtering, trend extrac-
tion.
JEL classification: G11, G17, C63.
1 Introduction
The efficient market hypothesis tells us that financial asset prices fully reflect all available
information (Fama, 1970). One consequence of this theory is that future returns are not
predictable. Nevertheless, since the beginning of the nineties, a large body of academic
research has rejected this assumption. One of the arguments is that risk premiums are time
varying and depend on the business cycle (Cochrane, 2001). In this framework, returns
on financial assets are related to some slow-moving economic variables that exhibit cyclical
patterns in accordance with the business cycle. Another argument is that some agents are
∗ We are grateful to Guillaume Jamet and Hoang-Phong Nguyen for their helpful comments.
9
not fully rational, meaning that prices may underreact in the short run but overreact at long
horizons (Hong and Stein, 1997). This phenomenon may be easily explained by the theory
of behavioural finance (Barberis and Thaler, 2002).
Based on these two arguments, it is now commonly accepted that prices may exhibit
trends or cycles. In some sense, these arguments chime with the Dow theory (Brown et al.,
1998), which is one of the first momentum strategies. A momentum strategy is an investment
style based only on the history of past prices (Chan et al., 1996). We generally distinguish
between two types of momentum strategy:
1. the trend following strategy, which consists of buying (or selling) an asset if the esti-
mated price trend is positive (or negative);
2. the contrarian (or mean-reverting) strategy, which consists of selling (or buying) an
asset if the estimated price trend is positive (or negative).
Contrarian strategies are clearly the opposite of trend following strategies. One of the tasks
involved in these strategies is to estimate the trend, excepted when based on mean-reverting
processes (see D’Aspremont, 2011). In this paper, we provide a survey of the different
trend filtering methods. However, trend filtering is just one of the difficulties in building a
momentum strategy. The complete process of constructing a momentum strategy is highly
complex, especially as regards transforming past trends into exposures – an important factor
that is beyond the scope of this paper.
The paper is organized as follows. Section two presents a survey of the different econo-
metric trend estimators. In particular, we distinguish between methods based on linear
filtering and nonlinear filtering. In section three, we consider some issues that arise when
trend filtering is applied in practice. We also propose some methods for calibrating trend
filtering models and highlight the problem of estimator variance. Section four offers some
concluding remarks.
2 A review of econometric estimators for trend filtering

Trend filtering (or trend detection) is a major task of time series analysis from both a
mathematical and financial viewpoint. The trend of a time series is considered to be the
component containing the global change, which contrasts with local changes due to noise.
The trend filtering procedure concerns not only the problem of denoising; it must also
take into account the dynamics of the underlying process. This explains why mathematical
approaches to trend extraction have a long history, and why this subject is still of great
interest to the scientific community1 . From an investment perspective, trend filtering is
fundamental to most momentum strategies developed in asset management and hedge funds
sectors in order to improve performance and limit portfolio risks.
2.1 The trend-cycle model

In economics, trend-cycle decomposition plays an important role by identifying the perma-
nent and transitory stochastic components in a non-stationary time series. Generally, the
permanent component can be interpreted as a trend, whereas the transitory component may
1 See Alexandrov et al. (2008).
10
be a noise or a stochastic cycle. Let yt be a stochastic process. We assume that yt is the

sum of two different unobservable parts:
yt = xt + εt
where xt represents the trend and εt is a stochastic (or noise) process. There is no precise
definition for trend, but it is generally accepted to be a smooth function representing long-
term movements:
“[...] the essential idea of trend is that it shall be smooth.” (Kendall, 1973).
It means that changes in the trend xt must be smaller than those of the process yt . From a
statistical standpoint, it implies that the volatility of yt − yt−1 is higher than the volatility
of xt − xt−1 :
σ (yt − yt−1 ) σ (xt − xt−1 )
One of the major problems in financial econometrics is the estimation of xt . This is the
subject of signal extraction and filtering (Pollock, 2009).
Finite moving average filtering for trend estimation has a long history. It has been used
in actuarial science since the beginning of the twentieth century2 . But the modern theory of
signal filtering has its origins in the Second World War and was formulated independently
by Norbert Wiener (1941) and Andrei Kolmogorov (1941) in two different ways. Wiener
worked principally in the frequency domain whereas Kolmogorov considered a time-domain
approach. This theory was extensively developed in the fifties and sixties by mathematicians
and statisticians such as Hermann Wold, Peter Whittle, Rudolf Kalman, Maurice Priestley,
George Box, etc. In economics, the problem of trend filtering is not a recent one, and may
date back to the seminal article of Muth (1960). It was extensively studied in the eighties and
nineties in the literature on business cycles, which led to a vast body of empirical research
being carried out in this area3 . However, it is in climatology that trend filtering is most
extensively studied nowadays. Another important point is that the development of filtering
techniques has evolved according to the development of computational power and the IT
industry. The Savitzky-Golay smoothing procedure may appear very basic today though it
was revolutionary4 when it was published in 1964.
In what follows, we review the class of filtering techniques that is generally used to
estimate a trend. Moving average filters play an important role in finance. As they are very
intuitive and easy to implement, they undoubtedly represent the model most commonly used
in trading strategies. The moving average technique belongs to the class of linear filters,
which share a lot of common properties. After studying this class of filters, we consider
some nonlinear filtering techniques, which may be well suited to solving financial problems.
2.2 Linear filtering

2.2.1 The convolution representation
We denote by y = {. . . , y−2 , y−1 , y0 , y1 , y2 , . . .} the ordered sequence of observations of the
process yt . Let x̂t be the estimator of the underlying trend xt which is by definition an
2 See, in particular, the works of Henderson (1916), Whittaker (1923) and Macaulay (1931).
3 See for example Cleveland and Tiao (1976), Beveridge and Nelson (1981), Harvey (1991) or Hodrick and
Prescott (1997).
4 The paper of Savitzky and Golay (1964) is still considered by the Analytical Chemistry journal to be
one of its 10 seminal papers.
11
unobservable process. A filtering procedure consists of applying a filter L to the data y:
x̂ = L (y)
with x̂ = {. . . , x̂−2 , x̂−1 , x̂0 , x̂1 , x̂2 , . . .}. When the filter is linear, we have x̂ = Ly with the
normalisation condition 1 = L1. If we assume that the signal yt is observed at regular
dates5 , we obtain:
∞

x̂t = Lt,t−i yt−i (1)
i=−∞
We deduce that linear filtering may be viewed as a convolution. The previous filter may not
be of much use, however, because it uses future values of yt . As a result, we generally impose
some restriction on the coefficients Lt,t−i in order to use only past and present values of the
signal. In this case, we say that the filter is causal. Moreover, if we restrict our study to
time invariant filters, the equation (1) becomes a simple convolution of the observed signal
yt with a window function Li :
n−1

x̂t = Li yt−i (2)
i=0
With this notation, a linear filter is characterised by a window kernel Li and its support.
The kernel defines the type of filtering, whereas the support defines the range of the filter.
For instance, if we take a square window on a compact support [0, T ] with T = nΔ the
width of the averaging window, we obtain the well-known moving average filter:
1
Li = 1 {i < n}
n
We finish this description by considering the lag representation:
n−1

x̂t = Li Li yt
i=0
with the lag operator L satisfying Lyt = yt−1 .
2.2.2 Measuring the trend and its derivative

We discuss here how to use linear filtering to measure the trend of an asset price and its
derivative. Let St be the asset price which follows the dynamics of the Black-Scholes model:
dSt
= μt dt + σt dWt
St
where μt is the drift, σt is the volatility and Wt is a standard Brownian motion. The
asset price St is observed in a series of discrete dates {t0 , . . . , tn }. Within this model, the
appropriate signal to be filtered is the logarithm of the price yt = ln St but not the price
itself. Let Rt = ln St − ln St−1 represent the realised return at time t over a unit period. If
μt and σt are known, we have:

1 √
Rt = μt − σt2 Δ + σt Δηt
2
5 We have ti+1 − ti = Δ.
12
where ηt is a standard Gaussian white noise. The filtered trend can be extracted using the
following equation:
n−1

x̂t = Li yt−i
i=0
6
and the estimator of μt is :
n−1
1
μ̂t Li Rt−i
Δ i=0
We can also obtain the same result by applying the filter directly to the signal and defining
the derivative of the window function as i = L̇i :
n
1
μ̂t i yt−i
Δ i=0
We obtain the following correspondence:

⎧
⎨ L0 if i = 0
i = Li − Li−1 if i = 1, . . . , n − 1 (3)
⎩
−Ln−1 if i = n
Remark 1 In some senses, μ̂t and x̂t are related by the following expression:
d
μ̂t = x̂t
dt
Econometric methods principally involve x̂t , whereas μ̂t is more important for trading strate-
gies.
Remark 2 μ̂t is a biased estimator of μt and the bias increases with the volatility of the
process σt . The expression of the unbiased estimator is then:
n−1
1 2 1
μ̂t = σt + Li Rt−i
2 Δ i=0
Remark 3 In the previous analysis, x̂t and μ̂t are two estimators. We may also represent
them by their corresponding probability density functions. It is therefore easy to derive
estimates, but we should not forget that these estimators present some variance. In finance,
and in particular in trading strategies, the question of statistical inference is generally not
addressed. However, it is a crucial factor in designing a successful momentum strategy.
2.2.3 Moving average filters

Average return over a given period Here, we consider the simplest case corresponding
to the moving average filter where the form of the window is:
1
Li = 1 {i < n}
n
In this case, the only calibration parameter is the window support, i.e. T = nΔ. It char-
acterises the smoothness of the filtered signal. For the limit T → 0, the window becomes
a Dirac distribution δt and the filtered signal is exactly the same as the observed signal:
6 If we neglect the contribution from the term σt2 . Moreover, we consider Δ = 1 to simplify the calculation.
13
x̂t = yt . For T > 0, if we assume that the noise εt is independent from xt and is a centered
process, the first contribution of the filtered signal is the average trend:
n−1
1
x̂t = xt−i
n i=0
If the trend is homogeneous, this average value is located at t − (n − 1) /2 by construction.

It means that the filtered signal lags the observed signal by a time period which is half the
window. To extract the derivative of the trend, we compute the derivative kernel i which
is given by the following formula:
1
i = (δi,0 − δi,n )
nΔ
where δi,j is the Kronecker delta7 . The main advantage of using a moving average filter is
the reduction of noise due to the central limit theorem. For the limit case n → ∞, the signal
is completely denoised but it corresponds to the average value of the trend. The estimator is
also biased. In trend filtering, we also face a trade-off between denoising maximisation and
bias minimisation. The problem is the calibration procedure for the lag window T . Another
way to determine the optimal parameter T is to take into account the dynamics of the
trend.
The above moving average filter can be applied directly to the signal. However, μ̂t is
simply the cumulative return over the window period. It needs only the first and last dates
of the period under consideration.
Moving average crossovers Many practitioners, and even individual investors, use the
moving average of the price itself as a trend indication, instead of the moving average of
returns. These moving averages are generally uniform moving averages of the price. Here
we will consider an average of the logarithm of the price, in order to be consistent with the
previous examples:
n−1
1
ŷtn = yt−i
n i=0
Of course, an average price does not estimate the trend μt . This trend is estimated from
the difference between two moving averages over two different time horizons n1 and n2 .
Supposing that n1 > n2 , the trend μ may be estimated from:
2
μ̂t (ŷ n2 − ŷtn1 ) (4)
(n1 − n2 ) Δ t
In particular, the estimated trend is positive if the short-term moving average is higher
than the long-term moving average. Thus, the sign of the trend changes when the short-
term moving average crosses the long-term moving average. Of course, when the short-term
horizon n1 is one, then the short-term moving average is just the current asset price. The
−1
scaling term 2 (n1 − n2 ) is explained below. It is derived from the interpretation of this
estimator as a weighted moving average of asset returns. Indeed, this estimator can be
interpreted in terms of asset returns by inverting the formula (3) with Li being interpreted
as the primitive of i : ⎧
⎨ 0 if i = 0
Li = i + Li−1 if i = 1, . . . , n − 1
⎩
−n+1 if i = n
7δ is equal to 1 if i = j and 0 otherwise.
i,j
14
The weighting of each return in the estimator (4) is represented in Figure 1. It forms a
triangle, and the biggest weighting is given at the horizon of the smallest moving average.
Therefore, depending on the horizon n2 of the shortest moving average, the indicator can
be focused toward the current trend (if n2 is small) or toward past trends (if n2 is as large
as n1 /2 for instance). From these weightings, in the case of a constant trend μ, we can
compute the expectation of the difference between the two moving averages:

n2 n1 n1 − n2 1 2
E [ŷt − ŷt ] = μ − σt Δ
2 2
Therefore, the scaling factor in formula (4) appears naturally.
Figure 1: Window function Li of moving average crossovers (n1 = 100)
Enhanced filters To improve the uniform moving average estimator, we may take the
following kernel function:
4 n
i = 2 sgn −i
n 2
We notice that the estimator μ̂t now takes into account all the dates of the window period.
By taking the primitive of the function i , the trend filter is given as follows:
4 n

Li = 2 −
i −
n 2 2
We now move to the second type of moving average filter which is characterised by an
asymmetric form of the convolution kernel. One possibility is to take an asymmetric window
function with a triangular form:
2
Li = (n − i) 1 {i < n}
n2
15
By computing the derivative of this window function, we obtain the following kernel:
2
i = (δi − 1 {i < n})
n
The filtering equation of μt then becomes:
n−1

2 1
μ̂t = xt − xt−i
n n i=0
Remark 4 Another way to define μ̂t is to consider the Lanczos generalised derivative
(Groetsch, 1998). Let f (x) be a function. We define the Lanczos derivative of f (x) in
terms of the following relationship:
ε
dL 3
f (x) = lim 3 tf (x + t) dt
dx ε→0 2ε −ε
In the discrete case, we have:

n
dL k=−n kf (x + kh)
f (x) = lim n 2
dx h→0 2 k=1 k h
We first notice that the Lanczos derivative is more general than the traditional derivative.
Although Lanczos’ formula is a more onerous method for finding the derivative, it offers
some advantages. This technique allows us to compute a “pseudo-derivative” at points where
the function is not differentiable. For the observable signal yt , the traditional derivative does
not exist because of the noise εt , but does in the case of the Lanczos derivative. Let us apply
the Lanczos’ formula to estimate the derivative of the trend at the point t − T /2. We obtain:
12 n
n
dL
x̂t = 3 − i yt−i
dt n i=0 2
We deduce that the kernel is:

12 n
i = 3
− i 1 {0 ≤ i ≤ n}
n 2
By computing an integration by parts, we obtain the trend filter:
6
Li = i (n − i) 1 {0 ≤ i ≤ n}
n3
In Figure 2, we have represented the different functions Li given in this paragraph. We
may extend these filters by computing the convolution of two or more filters. For exemple,
the mixed filter in Figure 2 is the convolution of the asymmetric filter with the Lanczos
filter. Let us apply these filters to the S&P 500 index. The results are given in Figure 3
for two values of the window length (n = 65 days and n = 260 days). We notice that the
choice of n has a big impact on the filtered series. The choice of the window function seems
to be less important at first sight. However, we should mention that traders are principally
interested in the derivative of the trend, and not the absolute value of the trend itself. In
this case, the window function may have a significant impact. Figure 4 is the scatterplot of
the μ̂t statistic in the case of the S&P 500 index from January 2000 to July 2011 (we have
considered the uniform and Lanczos filters using n = 260). We may also show that this
impact increases when we reduce the length of the window as illustrated in Table 1.
16
Figure 2: Window function Li of moving average filters (n = 100)
Figure 3: Trend estimate for the S&P 500 index
17
Table 1: Correlation between the uniform and Lanczos derivatives
n 5 10 22 65 130 260
Pearson ρ 84.67 87.86 90.14 90.52 92.57 94.03
Kendall τ 65.69 68.92 70.94 71.63 73.63 76.17
Spearman 83.15 86.09 88.17 88.92 90.18 92.19
Figure 4: Comparison of the derivative of the trend
2.2.4 Least squares filters

L2 filtering The previous Lanczos filter may be viewed as a local linear regression (Burch
et al., 2005). More generally, least squares methods are often used to define trend estimators:
n
1 2
{x̂1 , . . . , x̂n } = arg min (yt − x̂t )
2 t=1
However, this problem is not well-defined. We also need to impose some restrictions on the
underlying process yt or on the filtered trend x̂t to obtain a solution. For example, we may
consider a deterministic constant trend:
xt = xt−1 + μ
In this case, we have:

yt = μt + εt (5)
Estimating the filtered trend x̂t is also equivalent to estimating the coefficient μ:
n
tyt
μ̂ = t=1
n 2
t=1 t
18
If we consider a trend that is not constant, we may define the following objective function:
n n−1
1 2
2
(yt − x̂t ) + λ (x̂t−1 − 2x̂t + x̂t+1 )
2 t=1 t=2
In this function, λ is the regularisation parameter which controls the competition between
the smoothness8 of x̂t and the noise yt − x̂t . We may rewrite the objective function in the
vectorial form:
1 2 2
y − x̂ 2 + λ Dx̂ 2
2
where y = (y1 , . . . , yn ), x̂ = (x̂1 , . . . , x̂n ) and the D operator is the (n − 2) × n matrix:
⎡ ⎤
1 −2 1
⎢ 1 −2 1 ⎥
⎢ ⎥
⎢ .. ⎥
D=⎢ . ⎥
⎢ ⎥
⎣ 1 −2 1 ⎦
1 2 1
The estimator is then given by the following solution:

−1
x̂ = I + 2λD D y
It is known as the Hodrick-Prescott filter (or L2 filter). This filter plays an important role
in calibrating the business cycle.
Kalman filtering Another important trend estimation technique is the Kalman filter,
which is described in Appendix A.1. In this case, the trend μt is a hidden process which
follows a given dynamic. For example, we may assume that the model is9 :

Rt = μt + σζ ζt
(6)
μt = μt−1 + ση ηt
Here, the equation of Rt is the measurement equation and Rt is the observable signal of
realised returns. The hidden process μt is supposed to follow a random walk. We define
2
μ̂t|t−1 = Et−1 [μt ] and Pt|t−1 = Et−1 μ̂t|t−1 − μt . Using the results given in Appendix
A.1, we have:
μ̂t+1|t = (1 − Kt ) μ̂t|t−1 + Kt Rt

where Kt = Pt|t−1 / Pt|t−1 + σζ2 is the Kalman gain. The estimation error is determined
by Riccati’s equation:
Pt+1|t = Pt|t−1 + ση2 − Pt|t−1 Kt
Riccati’s equation gives us the stationary solution:
ση
P∗ = ση + ση2 + 4σζ2
2
The filter equation becomes:
μ̂t+1|t = (1 − κ) μ̂t|t−1 + κRt

8 We notice that the second term is the discrete derivative of the trend x̂ which characterises the smooth-
t
ness of the curve.
9 Equation (5) is a special case of this model if σ = 0.
η
19
with:
2σ
κ= η
ση + ση2 + 4σζ2
This Kalman filter can be considered as an exponential moving average filter with parame-
ter10 λ = − ln (1 − κ):
∞

μ̂t = 1 − e−λ e−λi Rt−i
i=0
with11 μ̂t = Et [μt ]. The filter of the trend x̂t is therefore determined by the following
equation:
∞

x̂t = 1 − e−λ e−λi yt−i
i=0
while the derivative of the trend may be directly related to the observed signal yt as follows:
∞

μ̂t = 1 − e−λ yt − 1 − e−λ eλ − 1 e−λi yt−i
i=1
In Figure 5, we reported the window function of the Kalman filter for several values of λ.
We notice that the cumulative weightings increase strongly with λ. The half-life of this filter
is approximatively equal to
λ−1 − 2−1 ln 2. For example, the half-life for λ = 5% is 14
days.
Figure 5: Window function Li of the Kalman filter
10 We have 0 < κ < 1 and lambda > 0.

11 We notice that μ̂t+1|t = μ̂t .
20
We may wonder what the link is between the regression model (5) and the Markov model
(6). Equation (5) is equivalent to the following state space model12 :

yt = xt + σε εt
xt = xt−1 + μ
If we now consider that the trend is stochastic, the model becomes:

yt = xt + σε εt
xt = xt−1 + μ + σζ ζt
This model is called the local level model. We may also assume that the slope of the trend
is stochastic, in which case we obtain the local linear trend model:
⎧
⎨ yt = xt + σε εt
xt = xt−1 + μt−1 + σζ ζt
⎩
These three models are special cases of structural models (Harvey, 1989) and may be easily
solved by Kalman filtering. We also deduce that the Markov model (6) is a special case of
the latter when σε = 0.
Remark 5 We have shown that Kalman filtering may be viewed as an exponential moving
average filter when we consider the Markov model (6). Nevertheless, we cannot regard the
Kalman filter simply as a moving average filter. First, the Kalman filter is the optimal
filter in the case of the linear Gaussian model described in Appendix A.1. Second, it could
be regarded as “an efficient computational solution of the least squares method” (Sorensen,
1970). Third, we could use it to solve more sophisticated processes than the Markov model
(6). However, some nonlinear or non Gaussian models may be too complex for Kalman
filtering. These nonlinear models can be solved by particle filters or sequential Monte Carlo
methods (see Doucet et al., 1998).
Another important feature of the Kalman approach is the derivation of an optimal

smoother (see Appendix A.1). At time t, we are interested by the numerical value of xt , but
also by the past values of xt−i because we would like to measure the slope of the trend. The
Kalman smoother improves the estimate of x̂t−i by using all the information between t − i
and t. Let us consider the previous example in relation to the S&P 500 index, using the local
level model. Figure 6 gives the filtered and smoothed components xt and μt for two sets
of parameters13 . We verify that the Kalman smoother reduces the noise by incorporating
more information. We also notice that the restriction σε = 0 increases the variance of the
trend and slope estimators.
2.3 Nonlinear filtering

In this section, we review other filtering approaches. They are generally classed as nonlinear
filters, because it is not possible to express the trend as a linear convolution of the signal
and a window function.
12 In
what follows, the noise processes are white noise: εt ∼ N (0, 1), ζt ∼ N (0, 1) and ηt ∼ N (0, 1).
13 For
the first set of parameters, we assume that σε = 100σζ and ση = 1/100σζ . For the second set of
parameters, we impose the restriction σε = 0.
21
Figure 6: Kalman filtered and smoothed components
2.3.1 Nonparametric regression

In the regression model (5), we assume that xt = f (t) while f (t) = μt. The model is said to
be parametric because the estimation of the trend consists of estimating the parameter μ.
We then have x̂t = μ̂t. With nonparametric regression, we directly estimate the function f ,
obtaining x̂t = fˆ (t). Some examples of nonparametric regression are kernel regression, loess
regression and spline regression. A popular method for trend filtering is local polynomial
regression:
yt = f (t) + εt
p
j
= β0 (τ ) + βj (τ ) (τ − t) + εt
j=1
For a given value of τ , we estimate the parameters β̂j (τ ) using weighted least squares with
the following weightings:
τ −t
wt = K
h
where K is the kernel function with a bandwidth h. We deduce that:
x̂t = E [ yt | τ = t] = β̂0 (t)
Cleveland (1979) proposed an improvement to the kernel regression through a two-stage

procedure (loess regression). First, we
fit a polynomial regression to estimate the residuals
ε̂t . Then, we compute δt = 1 − u2t · 1 {|ut | ≤ 1} with ut = ε̂t / (6 median (|ε̂|)) and run a
second kernel regression14 with weightings δt wt .
14 Cleveland (1979) suggests using the tricube kernel function to define K.
22
A spline function is a C 2 function S (τ ) which corresponds to a cubic polynomial function

on each interval [t, t + 1[. Let SP be the set of spline functions. We then have to solve the
following optimisation programme:
n T
2 2
min (1 − h) wt (yt − S (t)) + h wτ S (τ ) dτ
S∈SP 0
t=0
where h is the smoothing parameter – h = 0 corresponds to the interpolation case15 and

h = 1 corresponds to the linear regression16 .
Figure 7: Illustration of the kernel, loess and spline filters
We illustrate these three nonparametric methods in Figure 7. The calibration of these

filters is more complicated than for moving average filters, where the only parameter is the
length n of the window. With these methods, we have to decide the polynomial degree17 p,
the kernel function18 K and the smoothing parameter19 h.
2.3.2 L1 filtering
The idea of the Hodrick-Prescott filter can be generalised to a larger class of filters by using
the Lp penalty condition instead of the L2 penalty. This generalisation was previously
15 We have x̂t = S (t) = yt .
16 We have x̂t = S (t) = ĉ + μ̂t with (ĉ, μ̂) the OLS estimate of yt on a constant and time t because the
optimum is reached for S (τ ) = 0.
17 For the kernel regression, we use a Gaussian kernel with a bandwidth h = 0.10. We notice the impact
of the degree of polynomial. The higher the degree, the smoother the trend (and the slope of the trend).
18 For the loess regression, the degree of polynomial is set to 1 and the bandwidth h is 0.02. We show the
impact of the second step which modifies the kernel function.

19 For the spline regression, we consider a uniform kernel function. We notice that the parameter h has an
impact on the smoothness of the trend.
23
discussed in the work of Daubechies et al. (2004) in relation to the linear inverse problem,
while Tibshirani (1996) considers the Lasso regression problem. If we consider an L1 filter,
the objective function becomes:
n n−1
1 2

(yt − x̂t ) + λ |x̂t−1 − 2x̂t + x̂t+1 |
2 t=1 t=2
which is equivalent to the following vectorial form:

1 2
y − x̂ 2 + λ Dx̂ 1
2
Kim et al. (2009) shows that the dual problem of this L1 filter scheme is a quadratic
programme with some boundary constraints20 . To find x̂, we may also use the quadratic
programming algorithm, but Kim et al. (2009) suggest using the primal-dual interior point
method in order to optimise the numerical computation speed.
We have illustrated the L1 filter in Figure 8. Contrary to all other previous methods, the
filtered signal comprises a set of straight trends and breaks21 , because the L1 norm imposes
the condition that the second derivative of the filtered signal must be zero. The competition
between the two terms in the objective function turns to the competition between the number
of straight trends (or the number of breaks) and the closeness to the data. Thus, the
smoothing parameter λ plays an important role for detecting the number of breaks. This
explains why L1 filtering is radically different to L2 (or Hodrick-Prescott) filtering. Moreover,
it is easy to compute the slope of the trend μ̂t for the L1 filter. It is a step function, indicating
clearly if the trend is up or down, and when it changes (see Figure 8).
2.3.3 Wavelet filtering

Another way to estimate the trend xt is to denoise the signal yt by using spectral analy-
sis. The Fourier transform is an alternative representation of the original signal yt , which
becomes a frequency function:
n
y (ω) = yt e−iωt
t=1
We note y (ω) = F (y). By construction, we have y = F −1 (y) with F −1 the inverse Fourier
transform. A simple idea for denoising in spectral analysis is to set some coefficients y (ω)
to zero before reconstructing the signal. Figure 9 is an illustration of denoising using the
thresholding rule. Selected parts of the frequency spectrum can easily be manipulated by
filtering tools. For example, some can be attenuated, and others may be completely removed.
Applying the inverse Fourier transform to this filtered spectrum leads to a filtered time series.
Therefore, a smoothing signal can be easily performed by applying a low-pass filter, that is,
by removing the higher frequencies. For example, we have represented two denoised signals
of the S&P 500 index in Figure 9. For the first one, we use a 95% thresholding procedure
whereas 99% of the Fourier coefficients are set to zero in the second case. One difficulty
with this approach is the bad time location for low frequency signals and the bad frequency
location for the high frequency signals. It is then difficult to localise when the trend (which
is located in low frequencies) reverses. But the main drawback of spectral analysis is that
it is not well suited to nonstationary processes (Martin and Flandrin, 1985, Fuentes, 2002,
Oppenheim and Schafer, 2009).
20 The detail of this derivation is shown in Appendix A.2.

21 A break is the position where the signal trend changes.
24
Figure 8: L1 versus L2 filtering
Figure 9: Spectral filtering
25
A solution consists of adopting a double dimension analysis, both in time and frequency.
This approach corresponds to the wavelet analysis. The method of denoising is the same as
described previously and the estimation of xt is done in three steps:
1. we compute the wavelet transform W of the original signal yt to obtain the wavelet
coefficients ω = W (y);
2. we modify the wavelet coefficients according to a denoising rule D:
ω = D (ω)
3. We convert the modified wavelet coefficients into a new signal using the inverse wavelet
transform W −1 :
x = W −1 (ω )
There are two principal choices in this approach. First, we have to specify which mother
wavelet to use. Second, we have to define the denoising rule. Let ω − and ω + be two scalars
with 0 < ω − < ω + . Donoho and Johnstone (1995) define several shrinkage methods22 :
• Hard shrinkage
ωi = ωi · 1 |ωi | > ω +
• Soft shrinkage
ωi = sgn (ωi ) · |ωi | − ω + +
• Semi-soft shrinkage
⎧
⎨ 0 si |ωi | ≤ ω −
−1 +
ωi = sgn (ωi ) (ω − ω ) ω (|ωi | − ω ) si ω − < |ωi | ≤ ω +
+ − −
⎩
ωi si |ωi | > ω +
• Quantile shrinkage is a hard shrinkage method where w+ is the q th quantile of the

coefficients |ωi |.
Wavelet filtering is illustrated in Figure 10. We have computed the wavelet coefficients
using the cascade algorithm of Mallat (1989) and the low-pass and high-pass filters of order
6 proposed by Daubechies (1992). The filtered trend is obtained using quantile shrinkage.
In the first case, the noisy signal remains because we consider all the coefficients (q = 0). In
the second and third cases, 95% and 99% of the wavelet coefficients are set to zero23 .
2.3.4 Other methods

Many other methods can be used to perform trend filtering. The most recent include, for
example, singular spectrum analysis24 (Vautard et al., 1992), support vector machines25
and empirical mode decomposition (Flandrin et al., 2004). Moreover, we notice that traders
sometimes use their own techniques (see, inter alia, Ehlers, 2001).
22 In practice, the coefficients ωi are standardised before being computed.
23 Itis interesting to note that the denoising procedure retains some wavelet coefficients corresponding to
high and medium frequencies and located around the 2008 crisis.
24 See Appendix A.5 for an illustration.
25 A brief presentation is given in Appendix A.4.
26
Figure 10: Wavelet filtering
2.4 Multivariate filtering
Until now, we have assumed that the trend is specific to a financial asset. However, we may
be interested in estimating the common trend of several financial assets. For example, if we
wanted to estimate the trend of emerging markets equities, we could use a global index like
the MSCI EM or extract the trend by considering several indices, e.g. the Bovespa index
(Brazil), the RTS index (Russia), the Nifty index (India), the HSCEI index (China), etc. In
this case, the trend-cycle model becomes:
⎛ ⎞ ⎛ ⎞
(1) (1)
yt ε
⎜ . ⎟ ⎜ t. ⎟
⎜ . ⎟ = xt + ⎜ . ⎟
⎝ . ⎠ ⎝ . ⎠
(m) (m)
yt εt
(j) (j)
where yt and εt are respectively the signal and the noise of the financial asset j and xt
is the common trend. One idea for estimating the common trend is to obtain the mean of
the specific trends:
m
1 (j)
x̂t = x̂
m j=1 t
27
If we consider moving average filtering, it is equivalent to applying the filter to the average
m (j)
filter26 ȳt = m
1
j=1 yt . This rule is also valid for some nonlinear filters such as L1 filtering
(see Appendix A.2). In what follows, we consider the two main alternative approaches
developed in econometrics to estimate a (stochastic) common trend.
2.4.1 Error-correction model, common factors and the P-T decomposition
The econometrics of nonstationary time series may also help us to estimate a common trend.
(j) (j) (j)
yt is said to be integrated of order 1 if the change yt − yt−1 is stationary.
We will note
(j) (j) (1) (m)
yt ∼ I (1) and (1 − L) yt ∼ I (0). Let us now define yt = yt , . . . , yt . The vector yt
is cointegrated of rank r if there exists a matrix β of rank r such that zt = β yt ∼ I (0).
In this case, we show that yt may be specified by an error-correction model (Engle and
Granger, 1987):
∞

Δyt = γzt−1 + Φi Δyt−i + ζt (7)
i=1
where ζt is a I (0) vector process. Stock and Watson (1988) propose another interesting
representation of cointegration systems. Let ft be a vector of r common factors which are
I (1). Therefore, we have:
yt = Aft + ηt (8)
where ηt is a I (0) vector process and ft is a I (1) vector process. One of the difficulties with
this type of model is the identification step (Peña and Box, 1987). Gonzalo and Granger
(1995) suggest defining a permanent-transitory (P-T) decomposition:
y t = P t + Tt
such that the permanent component Pt is difference stationary, the transitory component Tt
is covariance stationary and (ΔPt , Tt ) satisfies a constrained autoregressive representation.
Using this framework and some other conditions, Gonzalo and Granger show that we may
obtain the representation (8) by estimating the relationship (7):
ft = γ̆ yt (9)
where γ̆ γ = 0. They then follow the works of Johansen (1988, 1991) to derive the maximum
likelihood estimator of γ̆. Once we have estimated the relationship (9), it is also easy to
identify the common trend27 x̂t .
26 We have:
1 XX
m n−1
(j)
x̂t = Li yt−i
m j=1 i=0
0 1
X
n−1
1 X (j) A
m
= Li @ y
i=0
m j=1 t−i
X
n−1
= Li ȳt−i
i=0
27 If a common trend exists, it is necessarily one of the common factors.
28
2.4.2 Common stochastic trend model

Another idea is to consider an extension of the local linear trend model:
⎧
⎨ yt = αxt + εt
xt = xt−1 + μt−1 + σζ ζt
⎩

(1) (m) (1) (m)
with yt = yt , . . . , yt , εt = εt , . . . , εt ∼ N (0, Ω), ζt ∼ N (0, 1) and ηt ∼ N (0, 1).
Moreover, we assume that εt , ζt and ηt are independent of each other. Given the parameters
(α, Ω, σζ , ση ), we may run the Kalman filter to estimate the trend xt and the slope μt whereas
the Kalman smoother allows us to estimate xt−i and μt−i at time t.
Remark 6 The case ση = 0 has been extensively studied by Chang et al. (2009). In
particular, they show that yt is cointegrated with β = Ω−1 Γ and Γ a m × (m − 1) matrix
such that Γ Ω−1 α = 0 and Γ Ω−1 Γ = Im−1 . Using the P-T decomposition, they also found
that the common stochastic trend is given by α Ω−1 yt , implying that the above averaging
rule is not optimal.
We come back to the example given in Figure 6 page 22. Using the second set of
parameters, we now consider three stock indices: the S&P 500 index, the Stoxx 600 index
and the MSCI EM index. For each index, we estimate the filtered trend. Moreover, using the
previous common stochastic trend model28 , we estimate the common trend for the bivariate
signal (S&P 500, Stoxx 600) and the trivariate signal (S&P 500, Stoxx 600, MSCI EM).
Figure 11: Multivariate Kalman filtering
28 We assume that αj takes the value 1 for the three signals.
29
3 Trend filtering in practice
3.1 The calibration problem
For the practical use of the trend extraction techniques discussed above, the calibration of
filtering parameters is crucial. These calibrated parameters must incorporate our prediction
requirement or they can be mapped to a commonly-known benchmark estimator. These
constraints offer us some criteria for determining the optimal parameters for our expected
prediction horizon. Below, we consider two possible calibration schemes based on these
criteria.
3.1.1 Calibration based on prediction error

One idea for estimating the parameters of a model is to use statistical inference tools. Let
us consider the local linear trend model. We may estimate the set of parameters (σε , σζ , ση )
by maximising the log-likelihood function29 :
n
1 v2
= ln 2π + ln Ft + t
2 t=1 Ft
# $
where vt = yt − Et−1 [yt ] is the innovation process and Ft = Et−1 vt2 is the variance of vt .
In Figure 12, we have reported the filtered and smoothed trend and slope estimated by the
maximum likelihood method. We notice that the estimated components are more noisy than
those obtained in Figure 6. We can explain this easily because maximum likelihood is based
on the one-day innovation process. If we want to look at a longer trend, we have to consider
the innovation process vt = yt − Et−h [yt ] where h is the horizon time. We have reported
the slope for h = 50 days in Figure 12. It is very different from the slope corresponding to
h = 1 day.
The problem is that the computation of the log-likelihood for the innovation process
vt = yt − Et−h [yt ] is trickier because there is generally no analytic expression. This is
why we do not recommend this technology for trend filtering problems, because the trends
estimated are generally very short-term. A better solution is to employ a cross-validation
procedure to calibrate the parameters θ of the filters discussed above. Let us consider the
calibration scheme presented in Figure 13. We divide our historical data into a training set
and a validation set, which are characterised by two time parameters T1 and T2 . The size
of training set T1 controls the precision of our calibration, for a fixed parameter θ. For this
training set, the value of the expectation of Et−h [yt ] is computed. The second parameter
29 Another way of estimating the parameters is to consider the log-likelihood function in the frequency
domain analysis (Roncalli, 2010). In the case of the local linear trend model, the stationary form of yt is
S (yt ) = (1 − L)2 yt . We deduce that the associated log-likelihood function is:
1 X 1 X I (λj )
n−1 n−1
n
=− ln 2π − ln f (λj ) −
2 2 j=0 2 j=0 f (λj )
where I (λj ) is the periodogram of S (yt ) and f (λ) is the spectral density:
ση2 + 2 (1 − cos λ) σζ2 + 4 (1 − cos λ)2 σε2

f (λ) =
2π
because we have:
S (yt ) = ση ηt−1 + σζ (1 − L) ζt + σε (1 − L)2 εt
30
Figure 12: Maximum likelihood of the trend and slope components
T2 determines the size of the validation set, which is used to estimate the prediction error:
n−h
2
e (θ; h) = (yt − Et−h [yt ])
t=1
This quantity is directly related to the prediction horizon h = T2 for a given investment
strategy. The minimisation of the prediction error leads to the optimal value θ of the filter
parameters which will be used to predict the trend for the test set. For example, we apply
this calibration scheme for L1 filtering for h equal to 50 days. Figure 14 illustrates the
calibration procedure for the S&P 500 index with T1 = 400 and T2 = 50. Minimising the
cumulative prediction error over the validation set gives the optimal value λ = 7.03.
Figure 13: Cross-validation procedure for determining optimal parameters θ
| | |
T1 T2 T2
|
3.1.2 Calibration based on benchmark estimator

The trend filtering algorithm can be calibrated with a benchmark estimator. In order to
illustrate this idea, we present in this discussion the calibration procedure for L2 filters by
31
Figure 14: Calibration procedure with the S&P 500 index for the L1 filter
using spectral analysis. Though the L2 filter provides an explicit solution which is a great
advantage for numerical implementation, the calibration of the smoothing parameter λ is
not straightforward. We propose to calibrate the L2 filter by comparing the spectral density
of this filter with that obtained using the uniform moving average filter with horizon n for
which the spectral density is:

2

n−1
MA 1
−iωt
f (ω) = 2
e
t=0
−1
For the L2 filter, the solution has the analytical form x̂ = 1 + 2λD D y. Therefore, the
spectral density can also be computed explicitly:
2
1
f HP (ω) =
1 + 4λ (3 − 4 cos ω + cos 2ω)
2
This spectral density can then be approximated by 1/ 1 + 2λω 4 . Hence, the spectral
−1/4
width is (2λ) for the L2 filter whereas it is 2πn−1 for the uniform moving average filter.
The calibration of the L2 filter could be achieved by matching these two quantities. Finally,
we obtain the following relationship:
1 n 4
λ ∝ λ =
2 2π
In Figure 15, we represent the spectral density of the uniform moving average filter for
different window sizes n. We also report the spectral density of the corresponding L2 filters.
To obtain this, we calibrated the optimal parameter λ by least square minimisation. In
32
Figure 16, we compare the optimal estimator λ with that corresponding to 10.27 × λ . We
notice that the approximation is very good30 .
Figure 15: Spectral density of moving average and L2 filters
3.2 What about the variance of the estimator?

Let μ̂t be the estimator of the slope of the trend. There may be a confusion between the
estimator of the slope and the estimated value of the slope (or the estimate). The estimator
is a random variable and is defined by a probability distribution function. Based on the
sample data, the estimator takes a value which is the estimate of the slope. Suppose that
we obtain an estimate of 10%. It means that 10% is the most likely value of the slope given
the data. But it does not mean that 10% is the true value of the slope.
3.2.1 Measuring the efficiency of trend filters

Let μ0t be the true value of the slope. In statistical inference, the quality of an estimator is
defined by the mean squared error (or MSE):
2
MSE (μ̂t ) = E μ̂t − μ0t
(1)
It indicates how far the estimates are from the true value. We say that the estimator μ̂t
(2)
is more efficient than the estimator μ̂t if its MSE is lower:

(1) (2) (1) (2)
μ̂t μ̂t ⇔ MSE μ̂t ≤ MSE μ̂t
30 We estimated the figure 10.27 using least squares.
33
Figure 16: Relationship between the value of λ and the length of the moving average filter
We may decompose the MSE statistic into two components:

# $2
2
MSE (μ̂t ) = E (μ̂t − E [μ̂t ]) + E E [μ̂t ] − μ0t
The first component is the variance of the estimator var (μ̂t ) whereas the second component
is the square of the bias B (μ̂t ). Generally, we are interested by estimators that are unbiased
(B (μ̂t ) = 0). If this is the case, comparing two estimators is equivalent to comparing their
variances.
Let us assume that the price process is a geometric Brownian motion:

dSt = μ0 St dt + σ 0 St dWt
In this case, the slope of the trend is constant and is equal to μ0 . In Figure 17, we have
reported the probability density function of the estimator μ̂t when the true slope μ0 is 10%.
We consider the estimator based on a uniform moving average filter of length n. First, we
notice that using filters is better than using the noisy signal. We also observe that the
variance of the estimators increases with the parameter σ 0 and decreases with the length n.
3.2.2 Trend detection versus trend filtering

In the previous paragraph, we saw that an estimate of the trend may not be significant if
the variance of the estimator is too large. Before computing an estimate of the trend, we
then have to decide if there is a trend or not. This process is called trend detection. Mann
(1945) considers the following statistic:
n−1
n−2
(n)
St = sgn (yt−i − yt−j )
i=0 j=i+1
34
Figure 17: Density of the estimator μ̂t
Figure 18: Impact of μ0 on the estimator μ̂t
35
with sgn (yt−i − yt−j ) = 1 if yt−i > yt−j and sgn (yt−i − yt−j ) = −1 if yt−i < yt−j . We
have31 : n (n − 1) (2n + 5)
(n)
var St =
18
We can show that:
n (n + 1) (n) n (n + 1)
− ≤ St ≤
2 2
The bounds are reached if yt < yt−i (negative trend) or yt > yt−i (positive trend) for i ∈ N∗ .
We can then normalise the score:
(n)
(n) 2St
St =
n (n + 1)
(n)
St takes the value +1 (or −1) if we have a perfect positive (or negative) trend. If there is
(n)
no trend, it is obvious that St 0. Under this null hypothesis, we have:
(n)
Zt −→ N (0, 1)
n→∞
with:
(n)
(n) St
Zt =%
(n)
var St
(n)
In Figure 19, we reported the normalised score St for the S&P 500 index and different
values of n. Statistics relating to the null hypothesis are given in Table 2 for the study
period. We notice that we generally reject the hypothesis that there is no trend when we
consider a period of one year. The number of cases when we observe a trend increases if we
consider a shorter period. For example, if n is equal to 10 days, we accept the hypothesis
that there is no trend in 42% of cases when the confidence level α is set to 90%.
Table 2: Frequencies of rejecting the null hypothesis with confidence level α
α 90% 95% 99%

n = 10 days 58.06% 49.47% 29.37%
n = 3 months 85.77% 82.87% 76.68%
n = 1 year 97.17% 96.78% 95.33%
(10)
Remark 7 We have reported the statistic St against the trend estimate32 μ̂t for the S&P
(10)
500 index since January 2000. We notice that μ̂t may be positive whereas St is negative.
This illustrates that a trend measurement is just an estimate. It does not mean that a trend
exists.
31 If there are some tied sequences (yt−i = yt−i−1 ), the formula becomes:
!
“ ” 1 X
g
(n)
var St = n (n − 1) (2n + 5) − nk (nk − 1) (2nk + 5)
18 k=1
with g the number of tied sequences and nk the number of data points in the kth tied sequence.
32 It is computed with a uniform moving average of 10 days.
36
Figure 19: Trend detection for the S&P 500 index
Figure 20: Trend detection versus trend filtering
37
3.3 From trend filtering to trend forecasting
There are two possible applications for the trend following problem. First, trend filtering
can analyse the past. A noisy signal can be transformed into a smoother signal, which can be
interpreted more easily. An ex-post analysis of this kind can, for instance, clearly separate
increasing price periods from decreasing price periods. This analysis can be performed on
any time series, or even on a random walk. For example, we have reported four simulations
of a geometric Brownian motion without drift and annual volatility of 20% in Figure 21. In
this context, trend filtering could help us to estimate the different trends in the past.
Figure 21: Four simulations of a geometric Brownian motion without drift
On the other hand, trend analysis may be used as a predictive tool. Prediction is a
much more ambitious objective than analysing the past. It cannot be performed on any
time series. For instance, trend following predictions suppose that the last observed trend
influences future returns. More precisely, these predictors suppose that positive (or negative)
trends are more likely to be followed by positive (or negative) returns. Such an assumption
has to be tested empirically. For example, it is obvious that the time series in Figure 21
exhibit certain trends, whereas we know that there is no trend in a geometric Brownian
motion without drift. Thus, we may still observe some trends in an ex-post analysis. It does
not mean, however, that trends will persist in the future.
The persistence of trends is tested here in a simple framework for major financial in-
dices33 . For each of these indices the average one-month returns are separated into two sets.
The first set includes one-month returns that immediately follow a positive three-month
return (this is negative for the second set). The average one-month return is computed for
each of these two sets, and the results are given in Table 3. These results clearly show
33 The study period begins in January 1995 (January 1999 for the MSCI EM) and finish in October 2011.
38
Figure 22: Distribution of the conditional standardised monthly return
that, on average, higher returns can be expected after a positive three-month return than
after a negative three-month period. Therefore, observation of the current trend may have a
predictive value for the indices under consideration. Moreover, we consider the distribution
of the one-month returns, based on past three-month returns. Figure 22 illustrates the case
of the GSCI index. In the first quadrant, the one-month returns are divided into two sets,
depending on whether the previous three-month return is positive or negative. The cumu-
lative distributions of these two sets are shown. In the second quadrant, we consider, on
the one hand, the distribution of one-month returns following a three-month return below
−5% and, on the other hand, the distribution of returns following a three-month return
exceeding +5%. The same procedure is repeated in the other quadrants, for a 10% and a
15% threshold. This simple test illustrates the usefulness of trend following strategies. Here,
trends seem persistent enough to study such strategies. Of course, on other time scales or
for other assets, one may obtain opposite results that would support contrarian strategies.
Table 3: Average one-month conditional return based on past trends
Trend Positive Negative Difference

Eurostoxx 50 1.1% 0.2% 0.9%
S&P 500 0.9% 0.5% 0.4%
MSCI WORLD 0.6% −0.3% 1.0%
MSCI EM 1.9% −0.3% 2.2%
TOPIX 0.4% −0.4% 0.9%
EUR/USD 0.2% −0.2% 0.4%
USD/JPY 0.2% −0.2% 0.4%
GSCI 1.3% −0.4% 1.6%
39
4 Conclusion
The ultimate goal of trend filtering in finance is to design portfolio strategies that may
benefit from these trends. But the path between trend measurement and portfolio allocation
is not straightforward. It involves studies and explanations that would not fit in this paper.
Nevertheless, let us point out some major issues. Of course, the first problem is the selection
of the trend filtering method. This selection may lead to a single procedure or to a pool of
methods. The selection of several methods raises the question of an aggregation procedure.
This can be done through averaging or dynamic model selection, for instance. The resulting
trend indicator is meant to forecast future asset returns at a given horizon.
Intuitively, an investor should buy assets with positive return forecasts and sell assets
with negative forecasts. But the size of each long or short position is a quantitative problem
that requires a clear investment process. This process should take into account the risk
entailed by each position, compared with the expected return. Traditionally, individual
risks can be calculated in relation to asset volatility. A correlation matrix can aggregate
those individual risks into a global portfolio risk. But in the case of a multi-asset trend
following strategy, should we consider the correlation of assets or the correlation of each
individual strategy? These may be quite different, as the correlations between strategies
are usually smaller than the correlations between assets in absolute terms. Even when the
portfolio risks can be calculated, the distribution of those risks between assets or strategies
remains an open problem. Clearly, this distribution should take into account the individual
risks, their correlations and the expected return of each asset. But there are many competing
allocation procedures, such as Markowitz portfolio theory or risk budgeting methods.
In addition, the total amount of risk in the portfolio must be decided. The average target
volatility of the portfolio is closely related to the risk aversion of the final investor. But this
total amount of risk may not be constant over time, as some periods could bring higher
expected returns than others. For example, some funds do not change the average size of
their positions during period of high market volatility. This increases their risks, but they
consider that their return opportunities, even when risk-adjusted, are greater during those
periods. On the contrary, some investors reduce their exposure to markets during volatility
peaks, in order to limit their potential drawdowns. Anyway, any consistent investment
process should measure and control the global risk of the portfolio.
These are just a few questions relating to trend following strategies. Many more arise in
practical cases, such as execution policies and transaction cost management. Each of these
issues must be studied in depth, and re-examined on a regular basis. This is the essence of
quantitative management processes.
40
A Statistical complements
A.1 State space model and Kalman filtering
A state space model is defined by a transition equation and a measurement equation. In
the measurement equation, we postulate the relationship between an observable vector and
a state vector, while the transition equation describes the generating process of the state
variables. The state vector αt is generated by a first-order Markov process of the form:
αt = Tt αt−1 + ct + Rt ηt
where αt is the vector of the m state variables, Tt is a m × m matrix, ct is a m × 1 vector
and Rt is a m × p matrix. The measurement equation of the state-space representation is:
yt = Zt αt + dt + εt
where yt is a n-dimension time series, Zt is a n × m matrix, dt is a n × 1 vector. ηt and εt
are assumed to be white noise processes of dimensions p and n respectively. These two last
uncorrelated processes are Gaussian with zero mean and respective covariance matrices Qt
and Ht . α0 ∼ N (a0 , P0 ) describes the initial position of the state vector. We define at and
a t|t−1 as the optimal estimators of αt based on all the information available respectively at
time t and t − 1. Let Pt and P t|t−1 be the associated covariance matrices34 . The Kalman
filter consists of the following set of recursive equations (Harvey, 1990):
⎧
⎪
⎪ a t|t−1 = Tt at−1 + ct
⎪
⎪
⎪
⎪ P t|t−1 = Tt Pt−1 Tt + Rt Qt Rt
⎪
⎪ y t|t−1 = Zt a t|t−1 + dt
⎨
vt = yt − y t|t−1
⎪
⎪
⎪
⎪ Ft = Zt P t|t−1 Zt + Ht
⎪
⎪
⎪
⎪ at = a t|t−1 + P t|t−1 Zt Ft−1 vt
⎩
Pt = Im − P t|t−1 Zt Ft−1 Zt P t|t−1
where vt is the innovation process with covariance matrix Ft and y t|t−1 = Et−1 [yt ]. Harvey
(1989) shows that we can obtain a t+1|t directly from a t|t−1 :
a t+1|t = (Tt+1 − Kt Zt ) a t|t−1 + Kt yt + (ct+1 − Kt dt )
where Kt = Tt+1 P t|t−1 Zt Ft−1 is the matrix of gain. We also have:

a t+1|t = Tt+1 a t|t−1 + ct+1 + Kt yt − Zt a t|t−1 − dt
Finally, we obtain:
yt = Zt a t|t−1 + dt + vt
a t+1|t = Tt+1 a t|t−1 + ct+1 + Kt vt
This system is called the innovation representation.

Let t be a fixed given date. We define a t|t = Et [αt ] and P t|t = Et a t|t − αt a t|t − αt
with t ≤ t . We have a t |t = at and P t |t = Pt . The Kalman smoother is then defined
by the following set of recursive equations:
−1
Pt∗
= Pt Tt+1 P t+1|t
∗

a t|t = at + Pt a t+1|t − a t+1|t
P t|t = Pt + Pt∗ P t+1|t − P t+1|t Pt∗
h i
34 We have at= Et [αt ], a t|t−1 = Et−1 [αt ], Pt = Et (at − αt ) (at − αt ) and P t|t−1 =
h` ´` ´ i
Et−1 a t|t−1 − αt a t|t−1 − αt where Et indicates the conditional expectation operator.
41
A.2 L1 filtering
A.2.1 The dual problem
The L1 filtering problem can be solved by considering the dual problem which is a QP
programme. We first rewrite the primal problem with a new variable z = Dx̂:
1 2
min y − x̂ 2 + λ z 1
2
u.c. z = Dx̂
We now construct the Lagrangian function with the dual variable ν ∈ Rn−2 :
1 2
L (x̂, z, v) = y − x̂ 2 + λ z 1 + ν (Dx̂ − z)
2
The dual objective function is obtained in the following way:
1
inf x̂,z L (x̂, z, ν) = − ν DD ν + y D ν
2
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent
to the dual problem:
1
min ν DD ν − y D ν
2
u.c. −λ1 ≤ ν ≤ λ1
This QP programme can be solved by a traditional Newton algorithm or by interior-point

methods, and finally, the solution of the trend is:
x̂ = y − D ν
A.2.2 Solving using interior-point algorithms

We briefly present the interior-point algorithm of Boyd and Vandenberghe (2009) in the case
of the following optimisation problem:
min f0 (θ)

Aθ = b
u.c.
fi (θ) < 0 for i = 1, . . . , m
where f0 , . . . , fm : Rn → R are convex and twice continuously differentiable and rank (A) =
p < n. The inequality constraints will become implicit if the problem is rewritten as:
m

min f0 (θ) + I− (fi (θ))
i=1
u.c. Aθ = b
where I− (u) : R → R is the non-positive indicator function35 . This indicator function is

discontinuous, so the Newton method can not be applied. In order to overcome this prob-
lem, we approximate I− (u) using the logarithmic barrier function I−
(u) = −τ −1 ln (−u)
35 We have: j
0 u≤0
I− (u) =
∞ u>0
42
with τ → ∞. Finally the Kuhn-Tucker condition for this approximation problem gives
rt (θ, λ, ν) = 0 with:
⎛
⎞
∇f0 (θ) + ∇f (θ) λ + A ν
rτ (θ, λ, ν) = ⎝ − diag (λ) f (θ) − τ −1 1 ⎠
Aθ − b
The solution of rτ (θ, λ, ν) = 0 can be obtained using Newton’s iteration for the triple
π = (θ, λ, ν):
rτ (π + Δπ) rτ (π) + ∇rτ (π) Δπ = 0
−1
This equation gives the Newton step Δπ = −∇rτ (π) rτ (π), which defines the search
direction.
A.2.3 The multivariate case

In the multivariate case, the primal problem is:
1 ' '2
m
' (j) '
min 'y − x̂' + λ z 1
2 j=1 2
u.c. z = Dx̂
The dual objective function becomes:
1 (j)
m
1
inf x̂,z L (x̂, z, ν) = − ν DD ν + ȳ D ν + y − ȳ y (j) − ȳ
2 2 j=1
for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent
to the dual problem:
1
min ν DD ν − ȳ D ν
2
u.c. −λ1 ≤ ν ≤ λ1
The solution is then x̂ = ȳ − D ν.
A.2.4 The scaling of the smoothing parameter

We can attempt to estimate the order of magnitude of the parameter λmax by considering
the continuous case. We assume that the signal is a process Wt . The value of λmax in the
discrete case is defined by: ' '
' −1 '
λmax = ' DD Dy '
∞
(T
can be considered as the first primitive I1 (T ) = 0 Wt dt of the process Wt if D = D1
(T (t
(L1 − C filtering) or the second primitive I2 (T ) = 0 0 Ws ds dt of Wt if D = D2 (L1 − T
filtering). We have:
T
I1 (T ) = Wt dt
0
T
= WT T − t dWt
0
T
= (T − t) dWt
0
43
The process I1 (T ) is a Wiener integral (or a Gaussian process) with variance:
T
# $ 2 T3
E I12 (T ) = (T − t) dt =
0 3
In this case, we expect that λmax ∼ T 3/2 . The second order primitive can be calculated in
the following way:
T
I2 (T ) = I1 (t) dt
0
T
= I1 (T ) T − t dI1 (T )
0
T
= I1 (T ) T − tWt dt
0
2 T 2
T t
= I1 (T ) T − WT + dWt
2 0 2
T
T2 t2
= − WT + T2 − Tt + dWt
2 0 2

1 T 2
= (T − t) dWT
2 0
This quantity is again a Gaussian process with variance:

1 T 4 T5
E[I22 (T )] = (T − t) dt =
4 0 20
In this case, we expect that λmax ∼ T 5/2 .
A.3 Wavelet analysis

The time analysis can detect anomalies in time series, such as a market crash on a specific
date. The frequency analysis detects repeated sequences in a signal. The double dimension
analysis makes it possible to coordinate time and frequency detection, as we use a larger
time window than a smaller frequency interval (see Figure 23). In this area, the uncertainty
of localisation is 1/dt, with dt the sampling step and f = 1/dt the sampling frequency. The
wavelet transform can be a solution to analysing time series in terms of the time-frequency
dimension.
The first wavelet approach appeared in the early eighties in seismic data analysis. The
term wavelet was introduced in the scientific community by Grossmann and Morlet (1984).
Since 1986, a great deal of theoretical research, including wavelets, has been developed.
The wavelet transform uses a basic function, called the mother wavelet, then dilates and
translates it to capture features that are local in time and frequency. The distribution of the
time-frequency domain with respect to the wavelet transform is long in time when capturing
low frequency events and long in frequency when capturing high frequency events. As an
example, we represent some mother wavelets in Figure 24.
The aim of wavelet analysis is to separate signal trends and details. These different
components can be distinguished by different levels of resolution or different sizes/scales
of detail. In this sense, it generates a phase space decomposition which is defined by two
44
Figure 23: Time-frequency dimension
Figure 24: Some mother wavelets
45
parameters (scale and location) in opposition to a Fourier decomposition. A wavelet ψ (t)
is a function of time t such that:
+∞
ψ (t) dt = 0
−∞
+∞
2
|ψ (t)| dt = 1
−∞
The continuous wavelet transform is a function of two variables W (u, s) and is given by
projecting the time series x (t) onto a particular wavelet ψ by:
+∞
W (u, s) = x (t) ψu,s (t) dt
−∞
with:
1 t−u
ψu,s (t) = √ ψ
s s
which corresponds to the mother wavelet translated by u (location parameter) and dilated
by s (scale parameter). If the wavelet satisfies the previous properties, the inverse operation
may be performed to produce the original signal from its wavelet coefficients:
+∞ +∞
x (t) = W (u, s) ψ (u, s) du ds
−∞ −∞
The continuous wavelet transform of a time series signal x (t) gives an infinite number
of coefficients W (u, s) where u ∈ R and s ∈ R+ , but many coefficients are close or equal to
zero. The discrete wavelet transform can be used to decompose a signal into a finite number
of coefficients where we use s = 2−j as the scale parameter and u = k2−j as the location
parameter with j ∈ Z and k ∈ Z. Therefore ψu,s (t) becomes:
j
ψj,k (t) = 2 2 ψ 2j t − k
where j = 1, 2, ..., J in a J-level decomposition. The wavelet representation of a discrete
signal x (t) is given by:
j−1
J−1
2
x (t) = s(0) φ (t) + d(j),k ψj,k (t)
j=0 k=0
where φ (t) = 1 if t ∈ [0, 1] and J is the number of multi-resolution levels. Therefore,

computing the wavelet transform of the discrete signal is equivalent to compute the smooth
coefficient s(0) and the detail coefficients d(j),k .
Introduced by Mallat (1989), the multi-scale analysis corresponds to the following iter-
ative scheme:
x

s d

ss sd

sss ssd

ssss sssd
46
where the high-pass filter defines the details of the data and the low-pass filter defines the
smoothing signal. In this example, we obtain these wavelet coefficients:
⎡ ⎤
ssss
⎢ sssd ⎥
⎢ ⎥
W =⎢ ⎢ ssd ⎥
⎥
⎣ sd ⎦
d
Applying this pyramidal algorithm to the time series signal up to the J resolution level gives
us the wavelet coefficients: ⎡ ⎤
s(0)
⎢ d(0) ⎥
⎢ ⎥
⎢ d(1) ⎥
⎢ ⎥
W =⎢ ⎢ . ⎥
⎥
⎢ . ⎥
⎢ ⎥
⎣ . ⎦
d(J−1)
A.4 Support vector machine

The support vector machine is an important part of statistical learning theory (Hastie et al.,
2009). It was first introduced by Boser et al. (1992) and has been used in various domains
such as pattern recognition, biometrics, etc. This technique can be employed in different
contexts such as classification, regression or density estimation (see Vapnik, 1998). Recently,
applications in finance have been developed in two main directions. The first employs the
SVM as a nonlinear estimator in order to forecast the trend or volatility of financial assets.
In this context, the SVM is used as a regression technique with the possibility for extension
to nonlinear cases thank to the kernel approach. The second direction consists of using
the SVM as a classification technique which aims to define the stock selection in trading
strategies.
A.4.1 SVM in a nutshell

We illustrate here the basic idea of the SVM as a classification method. Let us define the
training data set consisting of n pairs of “input/output” points (xi , yi ) where xi ∈ X and
yi ∈ {−1, 1}. The idea of linear classification is to look for a possible hyperplane that
can separate {xi ⊂ X } into two classes corresponding to the labels yi = ±1. It consists of
constructing a linear discriminant function h (x) = w x + b where w is the vector of weights
and b is called the bias. The hyperplane is then defined by the following equation:
H = {x : h (x) = w x + b = 0}
The vector w is interpreted as the normal vector to the hyperplane. We denote its norm
w and its direction ŵ = w/ w . In Figure 25, we give a geometric interpretation of the
margin in the linear case. Let x+ and x− be the closest points to the hyperplane from the
positive side and negative side. These points determine the margin to the boundary from
which the two classes of points D are separated:
1 1
mD (h) = ŵ (x+ − x− ) =
2 w
47
Figure 25: Geometric interpretation of the margin in a linear SVM
The main idea of a maximum margin classifier is to determine the hyperplane that maximises
the margin. For a separable dataset, the margin SVM is defined by the following optimisation
problem:
1 2
min w
w,b 2

u.c. yi w xi + b > 1 for i = 1, . . . , n
The historical approach to solving this quadratic problem with nonlinear constraints is to
map the primal problem to the dual problem:
n
n n
1
max αi − αi αj yi yj x
i xj
α
i=1
2 i=1 j=1
u.c. αi ≥ 0 for i = 1, . . . , n
Because of the Kuhn-Tucker

n conditions, the optimised solution (w , b ) of the primal problem
is given by w = i=1 αi yi xi where α = (α1 , . . . , αn ) is the solution of the dual problem.

We notice that linear SVM depends on input data via the inner product. An intelligent
way to extend SVM formalism to the nonlinear case is then to replace the inner product
with a nonlinear kernel. Hence, the nonlinear SVM dual problem can be obtained by sys-
tematically replacing the inner product xi xj by a general kernel K (xi , xj ). Some standard
kernels are widely used in pattern recognition, for example polynomial, radial basis or neural
48
network kernels36 . Finally, the decision/prediction function is then given by:

n

f (x) = sgn h (x) = sgn αi yi K (x, xi ) + b
i=1
A.4.2 SVM regression

In the last discussion, we presented the basic idea of the SVM in the classification context.
We now show how the regression problem can be interpreted as a SVM problem. In the
general framework of statistical learning, the SVM problem consists of minimising the risk
function R (f ) depending on the form of the prediction function f (x). The risk function is
calculated via the loss function L (f (x) , y) which clearly defines our objective (classification
or regression):
R (f ) = L (f (x) , y) dP (x, y)
where the distribution P (x, y) can be computed by empirical distribution37 or an approx-

imated distribution38 . For the regression problem, the loss function is simply defined as
2 p
L (f (x) , y) = (f (x) − y) or L (f (x) , y) = |f (x) − y| in the case of Lp norm.
We have seen that the linear SVM is a special case of nonlinear SVM within the kernel
approach. We therefore consider the nonlinear case directly where the approximate function
of the regression has the following form f (x) = w φ (x) + b. In the VRM framework, we
assume that P (x, y) is a Gaussian noise with variance σ 2 :
n
1 p 2
R (f ) = |f (xi ) − yi | + σ 2 w
n i=1
We introduce the variable ξ = (ξ1 , . . . , ξn ) which satisfies yi = f (xi ) + ξi . The optimisa-

tion problem of the risk function can now be written as a QP programme with nonlinear
constraints:
n
1 2 −1 p
min w + 2nσ 2 |ξi |
w,b,ξ 2 i=1
u.c. yi = w φ (xi ) + b + ξi for i = 1, . . . , n
In the present form, the regression looks very similar to the SVM classification problem and
can be solved in the same way by mapping to the dual problem. We notice that the SVM
regression can be easily generalised in two possible ways:
1. by introducing a more general loss function such as the ε-SV regression proposed by
Vapnik (1998);
2. by using a weighting distribution ω for the empirical distribution:
n

dP (x, y) = ωi δxi (x) δyi (y)
i=1
` ´p “ ´”
36 We 2 `
have, respectively, K (xi , xj ) = xi xj + 1 , K (xi , xj ) = exp − xi − xj / 2σ
2 or
` ´
K (xi , xj ) = tanh ax x
i j − b .
37 This framework called ERM was first introduced by Vapnik and Chervonenskis (1991).
38 This framework is called VRM (Chapelle, 2002).
49
As financial series have short memory and depend more on the recent past, an asym-
metric weight distribution focusing on recent data would improve the prediction39 .
The dual problem in the case p = 1 is given by:

1
max α y − α Kα
α 2

α 1=0
u.c. −1
|α| ≤ 2nσ 2 1
optimal vector α is obtained by solving the QP programme. We then

As previously, the
n
deduce that w = i=1 αi φ (xi ) and b is computed using the Kuhn-Tucker condition:
w φ (xi ) + b − yi = 0
for support vectors (xi , yi ). In order to achieve a good level of accuracy for the estimation
of b, we average out the set of support vectors and obtain b . The SVM regressor is then
given by the following formula:
n

f (x) = αi K (x, xi ) + b
i=1
with K (x, xi ) = φ (x) φ (xi ).
In Figure 26, we apply SVM regression with the Gaussian kernel to the S&P 500 index.
The kernel parameter σ characterises the estimation horizon which is equivalent to period
n in the moving average regression.
A.5 Singular spectrum analysis

In recent years the singular spectrum analysis (SSA) technique has been developed as a
time-frequency domain method40 . It consists of decomposing a time series into a trend,
oscillatory components and a noise.
The method is based on the principal component analysis of the auto-covariance matrix
of the time series y = (y1 , . . . , yt ). Let n be the window length such that n = t − m + 1 with
m < t/2. We define the n × m Hankel matrix H as the matrix of the m concatenated lag
vector of y: ⎛ ⎞
y1 y2 y3 ··· ym
⎜ y2 y3 y4 · · · ym+1 ⎟
⎜ ⎟
⎜ .. ⎟
H = ⎜ y3 ⎜ y4 y5 ··· . ⎟
⎟
⎜. . . . ⎟
⎝. . .
. .
. . . yt−1 ⎠
yn yn+1 yn+2 · · · yt
We recover the time series y by diagonal averaging:
m
1 (i,j)
yp = H (10)
αp j=1
39 See Gestel et al. (2001) and Tay and Cao 2002.

40 Introduced by Broomhead and King (1986).
50
Figure 26: SVM filtering
where i = p − j + 1, 0 < i < n + 1 and:

⎧
⎪
⎨p if p < m
αp = t − p + 1 if p > t − m + 1
⎪
⎩
m otherwise
This relationship seems trivial because each H(i,j) is equal to yp with respect to the condi-
tions for i and j. But this equality no longer holds if we apply factor analysis. Let C = H H
be the covariance matrix of H. By performing the eigenvalue decomposition C = V ΛV , we
can deduce the corresponding principal components:
Pk = HVk
where Vk is the matrix of the first k th eigenvectors of C.
Let us now define the n × m matrix Ĥ as follows:
Ĥ = Pk Vk
We have Ĥ = H if all the components are selected. If k < m, we have removed the noise and
the trend x̂ is estimated by applying the diagonal averaging procedure (10) to the matrix
Ĥ.
We have applied the singular spectrum decomposition to the S&P 500 index with different
lags m. For each lag, we compute the Hankel matrix H, then deduce the matrix Ĥ using
only the first eigenvector (k = 1) and estimate the corresponding trend. Results are given
in Figure 27. As for other methods, such as nonlinear filters, the calibration depends on the
parameter m, which controls the window length.
51
Figure 27: SSA filtering
52
References
[1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy T. (2008),
A Review of Some Modern Approaches to the Problem of Trend Extraction , US Census
Bureau, RRS #2008/03.
[2] Antoniadis A., Gregoire G. and McKeague I.W. (1994), Wavelet Methods for
Curve Estimation, Journal of the American Statistical Association, 89(428), pp. 1340-
1353.
[3] Barberis N. and Thaler T. (2002), A Survey of Behavioral Finance, NBER Working
Paper, 9222.
[4] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decomposition of
Economic Time Series into Permanent and Transitory Components with Particular
Attention to Measurement of the Business Cycle, Journal of Monetary Economics,
7(2), pp. 151-174.
[5] Boser B.E., Guyon I.M. and Vapnik V. (1992), A Training Algorithm for Optimal
Margin Classifier, Proceedings of the Fifth Annual Workshop on Computational Learn-
ing Theory, pp. 114-152.
[6] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge University
Press.
[7] Brockwell P.J. and Davis R.A. (2003), Introduction to Time Series and Forecasting,
Springer.
[8] Broomhead D.S. and King G.P. (1986), On the Qualitative Analysis of Experimental
Dynamical Systems, in Sarkar S. (ed.), Nonlinear Phenomena and Chaos, Adam Hilger,
pp. 113-144.
[9] Brown S.J., Goetzmann W.N. and Kumar A. (1998), The Dow Theory: William
Peter Hamilton’s Track Record Reconsidered, Journal of Finance, 53(4), pp. 1311-1333.
[10] Burch N., Fishback P.E. and Gordon R. (2005), The Least-Squares Property of the
Lanczos Derivative, Mathematics Magazine, 78(5), pp. 368-378.
[11] Carhart M.M. (1997), On Persistence in Mutual Fund Performance, Journal of Fi-
nance, 52(1), pp. 57-82.
[12] Chan L.K.C., Jegadeesh N. and Lakonishok J. (1996), Momentum Strategies, Jour-
nal of Finance, 51(5), pp. 1681-1713.
[13] Chang Y., Miller J.I. and Park J.Y. (2009), Extracting a Common Stochastic Trend:
Theory with Some Applications, Journal of Econometrics, 150(2), pp. 231-247.
[14] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive Tuning
and Prior Knowledge, PhD thesis, University of Paris 6.
[15] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Series: A
Model for the Census X-11 Program, Journal of the American Statistical Association,
71(355), pp. 581-587.
[16] Cleveland W.S. (1979), Robust Locally Regression and Smoothing Scatterplots, Jour-
nal of the American Statistical Association, 74(368), pp. 829-836.
53
[17] Cleveland W.S. and Devlin S.J. (1988), Locally Weighted Regression: An Approach
to Regression Analysis by Local Fitting, Journal of the American Statistical Associa-
tion, 83(403), pp. 596-610.
[18] Cochrane J. (2001), Asset Pricing, Princeton University Press.
[19] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learning, 20(3),
pp. 273-297.
[20] D’Aspremont A. (2011), Identifying Small Mean Reverting Portfolios, Quantitative

Finance, 11(3), pp. 351-364.
[21] Daubechies I. (1992), Ten Lectures on Wavelets, SIAM.
[22] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding Al-
gorithm for Linear Inverse Problems with a Sparsity Constraint, Communications on
Pure and Applied Mathematics, 57(11), pp. 1413-1457.
[23] Donoho D.L. (1995), De-Noising by Soft-Thresholding, IEEE Transactions on Infor-

mation Theory, 41(3), pp. 613-627.
[24] Donoho D.L. and Johnstone I.M. (1994), Ideal Spatial Adaptation via Wavelet
Shrinkage, Biometrika, 81(3), pp. 425-455.
[25] Donoho D.L. and Johnstone I.M. (1995), Adapting to Unknown Smoothness via
Wavelet Shrinkage, Journal of the American Statistical Association, 90(432), pp. 1200-
1224.
[26] Doucet A., De Freitas N. and Gordon N. (2001), Sequential Monte Carlo in Prac-
tice, Springer.
[27] Ehlers J.F. (2001), Rocket Science for Traders: Digital Signal Processing Applications,
John Wiley & Sons.
[28] Elton E.J. and Gruber M.J. (1972), Earnings Estimates and the Accuracy of Expec-
tational Data, Management Science, 18(8), pp. 409-424.
[29] Engle R.F. and Granger C.W.J. (1987), Co-Integration and Error Correction: Rep-
resentation, Estimation, and Testing, Econometrica, 55(2), pp. 251-276.
[30] Fama E. (1970), Efficient Capital Markets: A Review of Theory and Empirical Work,
Journal of Finance, 25(2), pp. 383-417.
[31] Flandrin P., Rilling G. and Goncalves P. (2004), Empirical Mode Decomposition
as a Filter Bank, Signal Processing Letters, 11(2), pp. 112-114.
[32] Fliess M. and Join C. (2009), A Mathematical Proof of the Existence of Trends in
Financial Time Series, in El Jai A., Afifi L. and Zerrik E. (eds), Systems Theory:
Modeling, Analysis and Control, Presses Universitaires de Perpignan, pp. 43-62.
[33] Fuentes M. (2002), Spectral Methods for Nonstationary Spatial Processes, Biometrika,
89(1), pp. 197-210.
[34] Gençay R., Selçuk F. and Whitcher B. (2002), An Introduction to Wavelets and
Other Filtering Methods in Finance and Economics, Academic Press.
54
[35] Gestel T.V., Suykens J.A.K., Baestaens D., Lambrechts A., Lanckriet G.,
Vandaele B., De Moor B. and Vandewalle J. (2001), Financial Time Series Pre-
diction Using Least Squares Support Vector Machines Within the Evidence Framework,
IEEE Transactions on Neural Networks, 12(4), pp. 809-821.
[36] Golyandina N., Nekrutkin V.V. and Zhigljavsky A.A. (2001), Analysis of Time
Series Structure: SSA and Related Techniques, Chapman & Hall, CRC.
[37] Gonzalo J. and Granger C.W.J. (1995), Estimation of Common Long-Memory Com-
ponents in Cointegrated Systems, Journal of Business & Economic Statistics, 13(1), pp.
27-35.
[38] Grinblatt M., Titman S. and Wermers R. (1995), Momentum Investment Strate-
gies, Portfolio Performance, and Herding: A Study of Mutual Fund Behavior, American
Economic Review, 85(5), pp. 1088-1105.
[39] Groetsch C.W. (1998), Lanczo’s Generalized Derivative, American Mathematical

Monthly, 105(4), pp. 320-326.
[40] Grossmann A. and Morlet J. (1984), Decomposition of Hardy Functions into Square
Integrable Wavelets of Constant Shape, SIAM Journal of Mathematical Analysis, 15,
pp. 723-736.
[41] Härdle W. (1992), Applied Nonparametric Regression, Cambridge University Press.
[42] Harvey A.C. (1989), Forecasting, Structural Time Series Models and the Kalman Fil-
ter, Cambridge University Press.
[43] Harvey A.C. and Trimbur T.M. (2003), General Model-Based Filters for Extracting
Cycles and Trends in Economic Time Series, Review of Economics and Statistics, 85(2),
pp. 244-255.
[44] Hastie T., Tibshirani R. and Friedman R. (2009), The Elements of Statistical Learn-
ing, second edition, Springer.
[45] Henderson R. (1916), Note on Graduation by Adjusted Average, Transactions of the

Actuarial Society of America, 17, pp. 43-48.
[46] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An Empirical
Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16.
[47] Holt C.C. (1959), Forecasting Seasonals and Trends by Exponentially Weighted Mov-
ing Averages, ONR Research Memorandum, 52, reprinted in International Journal of
Forecasting, 2004, 20(1), pp. 5-10.
[48] Hong H. and Stein J.C. (1977), A Unified Theory of Underreaction, Momentum Trad-
ing and Overreaction in Asset Markets, NBER Working Paper, 6324.
[49] Johansen S. (1988), Statistical Analysis of Cointegration Vectors, Journal of Economic

Dynamics and Control, 12(2-3), pp. 231-254.
[50] Johansen S. (1991), Estimation and Hypothesis Testing of Cointegration Vectors in

Gaussian Vector Autoregressive Models, Econometrica, 52(6), pp. 1551-1580.
[51] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via Flexible
Least Squares, Computers & Mathematics with Applications, 17, pp. 1215-1245.
55
[52] Kalman R.E. (1960), A New Approach to Linear Filtering and Prediction Problems,
Transactions of the ASME – Journal of Basic Engineering, 82(D), pp. 35-45.
[53] Kendall M.G. (1973), Time Series, Charles Griffin.
[54] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009), 1 Trend Filtering, SIAM
Review, 51(2), pp. 339-360.
[55] Kolmogorov A.N. (1941), Interpolation and Extrapolation of Random Sequences,

Izvestiya Akademii Nauk SSSR, Seriya Matematicheskaya, 5(1), pp. 3-14.
[56] Macaulay F. (1931), The Smoothing of Time Series, National Bureau of Economic
Research.
[57] Mallat S.G. (1989), A Theory for Multiresolution Signal Decomposition: The Wavelet
Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence,
11(7), pp. 674-693.
[58] Mann H.B. (1945), Nonparametric Tests against Trend, Econometrica, 13(3), pp. 245-
259.
[59] Martin W. and Flandrin P. (1985), Wigner-Ville Spectral Analysis of Nonstationary

Processes, IEEE Transactions on Acoustics, Speech and Signal Processing, 33(6), pp.
1461-1470.
[60] Muth J.F. (1960), Optimal Properties of Exponentially Weighted Forecasts, Journal
of the American Statistical Association, 55(290), pp. 299-306.
[61] Oppenheim A.V. and Schafer R.W. (2009), Discrete-Time Signal Processing, third
edition, Prentice-Hall.
[62] Peña D. and Box, G.E.P. (1987), Identifying a Simplifying Structure in Time Series,
Journal of the American Statistical Association, 82(399), pp. 836-843.
[63] Pollock, D.S.G. (2006), Wiener-Kolmogorov Filtering Frequency-Selective Filtering

and Polynomial Regression, Econometric Theory, 23, pp. 71-83.
[64] Pollock D.S.G. (2009), Statistical Signal Extraction: A Partial Survey, in Kon-
toghiorges E. and Belsley D.E. (eds.), Handbook of Empirical Econometrics, John Wiley
and Sons.
[65] Rao S.T. and Zurbenko I.G. (1994), Detecting and Tracking Changes in Ozone air
Quality, Journal of Air and Waste Management Association, 44(9), pp. 1089-1092.
[66] Roncalli T. (2010), La Gestion d’Actifs Quantitative, Economica.
[67] Savitzky A. and Golay M.J.E. (1964), Smoothing and Differentiation of Data by
Simplified Least Squares Procedures, Analytical Chemistry, 36(8), pp. 1627-1639.
[68] Silverman B.W. (1985), Some Aspects of the Spline Smoothing Approach to Non-
Parametric Regression Curve Fitting, Journal of the Royal Statistical Society, B47(1),
pp. 1-52.
[69] Sorenson H.W. (1970), Least-Squares Estimation: From Gauss to Kalman, IEEE
Spectrum, 7, pp. 63-68.
56
[70] Stock J.H. and Watson M.W. (1988), Variable Trends in Economic Time Series,
Journal of Economic Perspectives, 2(3), pp. 147-174.
[71] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Financial Times
Series Forecasting, Neurocomputing, 48(1-4), pp. 847-861.
[72] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of
the Royal Statistical Society, B58(1), pp. 267-288.
[73] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons, New York.
[74] Vapnik V. and Chervonenskis A. (1991), On the Uniform Convergence of Relative

Frequency of Events to their Probabilities, Theory of Probability and its Applications,
16(2), pp. 264-280.
[75] Vautard R., Yiou P., and Ghil M. (1992), Singular Spectrum Analysis: A Toolkit
for Short, Noisy Chaotic Signals, Physica D, 58(1-4), pp. 95-126.
[76] Wahba G. (1990), Spline Models for Observational Data, CBMS-NSF Regional Con-
ference Series in Applied Mathematics, 59, SIAM.
[77] Wang Y. (1998), Change Curve Estimation via Wavelets, Journal of the American
Statistical Association, 93(441), pp. 163-172.
[78] Wiener N. (1949), Extrapolation, Interpolation and Smoothing of Stationary Time
Series with Engineering Applications, MIT Technology Press and John Wiley & Sons
(originally published in 1941 as a Report on the Services Research Project, DIC-6037).
[79] Whittaker E.T. (1923), On a New Method of Graduation, Proceedings of the Edin-
burgh Mathematical Society, 41, pp. 63-75.
[80] Winters P.R. (1960), Forecasting Sales by Exponentially Weighted Moving Averages,
Management Science, 6(3), 324-342.
[81] Yue S. and Pilon P. (2004), A Comparison of the Power of the t-test, Mann-Kendall
and Bootstrap Tests for Trend Detection, Hydrological Sciences Journal, 49(1), 21-37.
[82] Zurbenko I., Porter P.S., Rao S.T., Ku J.K., Gui R. and Eskridge R.E. (1996),
Detecting Discontinuities in Time Series of Upper-Air Data: Demonstration of an Adap-
tive Filter Technique, Journal of Climate, 9(12), pp. 3548-3560.
57
58
Lyxor White Paper Series

List of Issues
• Issue #1 – Risk-Based Indexation.

Paul Demey, Sébastien Maillard and Thierry Roncalli, March 2010.
• Issue #2 – Beyond Liability-Driven Investment: New Perspectives on

Defined Benefit Pension Fund Management.
Benjamin Bruder, Guillaume Jamet and Guillaume Lasserre, March 2010.
• Issue #3 – Mutual Fund Ratings and Performance Persistence.

Pierre Hereil, Philippe Mitaine, Nicolas Moussavi and Thierry Roncalli, June 2010.
• Issue #4 – Time Varying Risk Premiums & Business Cycles: A Survey.

Serge Darolles, Karl Eychenne and Stéphane Martinetti, September 2010.
• Issue #5 – Portfolio Allocation of Hedge Funds.

Benjamin Bruder, Serge Darolles, Abdul Koudiraty and Thierry Roncalli, January
2011.
• Issue #6 – Strategic Asset Allocation.

Karl Eychenne, Stéphane Martinetti and Thierry Roncalli, March 2011.
• Issue #7 – Risk-Return Analysis of Dynamic Investment Strategies.

Benjamin Bruder and Nicolas Gaussel, June 2011.
59
60
Disclaimer
Each of this material and its content is confidential and may not be reproduced or provided
to others without the express written permission of Lyxor Asset Management (“Lyxor AM”).
This material has been prepared solely for informational purposes only and it is not intended
to be and should not be considered as an offer, or a solicitation of an offer, or an invitation
or a personal recommendation to buy or sell participating shares in any Lyxor Fund, or
any security or financial instrument, or to participate in any investment strategy, directly
or indirectly.
It is intended for use only by those recipients to whom it is made directly available by Lyxor
AM. Lyxor AM will not treat recipients of this material as its clients by virtue of their
receiving this material.
This material reflects the views and opinions of the individual authors at this date and in
no way the official position or advices of any kind of these authors or of Lyxor AM and thus
does not engage the responsibility of Lyxor AM nor of any of its officers or employees. All
performance information set forth herein is based on historical data and, in some cases, hy-
pothetical data, and may reflect certain assumptions with respect to fees, expenses, taxes,
capital charges, allocations and other factors that affect the computation of the returns.
Past performance is not necessarily a guide to future performance. While the information
(including any historical or hypothetical returns) in this material has been obtained from
external sources deemed reliable, neither Société Générale (“SG”), Lyxor AM, nor their af-
filiates, officers employees guarantee its accuracy, timeliness or completeness. Any opinions
expressed herein are statements of our judgment on this date and are subject to change with-
out notice. SG, Lyxor AM and their affiliates assume no fiduciary responsibility or liability
for any consequences, financial or otherwise, arising from, an investment in any security or
financial instrument described herein or in any other security, or from the implementation
of any investment strategy.
Lyxor AM and its affiliates may from time to time deal in, profit from the trading of, hold,
have positions in, or act as market makers, advisers, brokers or otherwise in relation to the
securities and financial instruments described herein.
Service marks appearing herein are the exclusive property of SG and its affiliates, as the
case may be.
This material is communicated by Lyxor Asset Management, which is authorized and reg-
ulated in France by the “Autorité des Marchés Financiers” (French Financial Markets Au-
thority).
c
2011 LYXOR ASSET MANAGEMENT ALL RIGHTS RESERVED
61
The Lyxor White Paper Series is a quarterly publication providing our
clients access to intellectual capital, risk analytics and quantitative
research developed within Lyxor Asset Management. The Series
covers in depth studies of investment strategies, asset allocation
methodologies and risk management techniques. We hope you will
find the Lyxor White Paper Series stimulating and interesting.
PUBLISHING DIRECTORS
Alain Dubois, Chairman of the Board
Laurent Seyer, Chief Executive Officer
EDITORIAL BOARD
Nicolas Gaussel, PhD, Managing Editor.
Thierry Roncalli, PhD, Associate Editor
Benjamin Bruder, PhD, Associate Editor
Réf. 712100 – Studio Société Générale +33 (0)1 42 14 27 05 – 12/2011
Lyxor Asset Management

Tour Société Générale – 17 cours Valmy
92987 Paris – La Défense Cedex – France
research@lyxor.com – www.lyxor.com

Momentum-Strategies PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Momentum-Strategies PDF

Uploaded by

Copyright:

Available Formats

University of Paris 7 - Lyxor Asset Management

September 30, 2011

Electronic copy available at: http://ssrn.com/abstract=2358988

1 Trading Strategies with L1 Filtering 1

2 Volatility Estimation for Trading Strategies 21

3 Support Vector Machine in Finance 59

4 Analysis of Trading Impact in the CTA strategy 109

A Appendix of chaper 1 115

B Appendix of chapter 2 123

Published paper 131

1.1 L1 − T filtering versus HP filtering for the model (1.2) . . . . . . . . 5

2.1 Data set of 1 trading day . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Geometric interpretation of the margin in a linear SVM. . . . . . . . 61

A.1 Spectral density of moving-average and L2 filters . . . . . . . . . . . 120

1.1 Results for the Backtest . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Estimation error for various estimators . . . . . . . . . . . . . . . . . 34

Thank to my parents, my sister who always believe in me and support me during

This document contains information confidential and proprietary to Lyxor Asset

In the first chapter, we discuss various implementation of L1 filtering in order

The third chapter is dedicated to the study of general framework of machine-

We finish in Chapter 4 by the performance analysis of CTA strategy. We review

Trading Strategies with L1

In this chapter, we discuss various implementation of L1 filtering in order to detect

Keywords: Momentum strategy, L1 filtering, L2 filtering, trend-following, mean-

The paper is organized as follows. In section 2, we discuss the trend-cycle decom-

Let us consider a time series yt which can be decomposed by a slowly varying

1.3 L1 filtering schemes

The explicit expression of x? allows a very simple numerical implementation with

which is equivalent to the following vectorial form:

We present in Figure 2.19 the comparison between L1 − T and HP filtering schemes3 .

We present in Figure 1.2 the comparison between L1 − T filtering and HP filtering

1.3.2 Extension to mean-reverting process

Figure 1.1: L1 − T filtering versus HP filtering for the model (1.2)

Signal Noisy signal

500 1000 1500 2000 500 1000 1500 2000

500 1000 1500 2000 500 1000 1500 2000

Figure 1.2: L1 -T filtering versus HP filtering for the model (1.3)

Signal Noisy signal

500 1000 1500 2000 500 1000 1500 2000

500 1000 1500 2000 500 1000 1500 2000

Figure 1.3: L1 − C filtering versus HP filtering for the model (1.5)

Signal Noisy signal

500 1000 1500 2000 500 1000 1500 2000

500 1000 1500 2000 500 1000 1500 2000

Figure 1.4: L1 − C filtering versus HP filtering for the model (1.6)

Signal Noisy signal

1.3.3 Mixing trend and mean-reverting properties

which can be again rewritten in the matrix form:

Figure 1.5: L1 − T C filtering versus HP filtering for the model (1.2)

Signal Noisy signal

500 1000 1500 2000 500 1000 1500 2000

500 1000 1500 2000 500 1000 1500 2000

1.3.4 How to calibrate the regularization parameters?

Figure 1.6: L1 − T C filtering versus HP filtering for the model (1.3)

Signal Noisy signal

500 1000 1500 2000 500 1000 1500 2000

500 1000 1500 2000 500 1000 1500 2000

via the mean value of all the λimax parameter:

In Figure 1.7, we show the results obtained with p = 2 (λ = 1 500) and p = 6

Figure 1.7: Influence of the smoothing parameter λ

7.6 S&P 500

2007 2008 2009 2010 2011

Moreover, the explicit calculation of a Brownian motion process gives us the