You are on page 1of 17

Statistical Arbitrage for Mid-frequency Trading

Nicolas Kseib, Xiaolin Lin, Lorenzo Limonta, Mike Phulsuksombati


June 11, 2014
Abstract
The main goal of this project is to generate and exploit the trading signal from
real-life high-frequency/mid-frequency trading data. With the aid from Thesys, we are
able to use real-life trading data to explore and evaluate Statistical Arbitrage based
algorithms. We implement a PCA analysis to isolate residual signals. By using multiple data mining techniques, we developed market neutral trading strategies. The
parameters for different learning methods were updated using walk-forward optimization. Finally, we simulate the trading strategies using real data and evaluate their
performance. The results show our methods, by implementing different models as well
as raw residual signal, can generate profitable strategies in pre-managed data.

Intoduction

In the field of investment, statistical arbitrage refers to strategies attempting to profit from
pricing inefficiencies in the market, identified through mathematical models. The basic
assumption of any such strategy is that prices of similar securities will move towards a
historical average. It encompasses a variety of strategies and investment programs whose
common features are:
Trading signals are systematic
Trading book is market-neutral
The mechanism for generating excess returns is statistical
The idea is to make many bets with positive expected returns, taking advantage of diversication across stocks, to produce a low-volatility investment strategy which is uncorrelated
with the market.
Historically the father of modern statistical arbitrage techniques is pairs trading, a strategy where two security with similar return behavior are first identified and then traded.
Once their respective value diverge significantly from the expected mean, one goes long on
the security under performing whilst going short on the security performing better than expected. This is done under the assumption that on the long term their price will converge
back to their mean.
1

In this paper we follow the natural extension of such strategy, rather than simply choosing
a pair, we trade group of stocks against other group of stock, thus implementing a generalized
pairs-trading technique. Innovative of our technique is the trading time-horizon, extremely
different from usual arbitrage strategies. Rather than time windows of weeks or months we
behave as high frequency trading (HFT) firms, looking for imbalances in the short term, in
the order of a few minutes at most.
In this section, we introduce one way to construct the residual signals. This signal will be
basis of our trading strategy, as it will allow us to distinguish which information are relevant
and which are noise. In section 2, we present the walk forward optimization (WFO), the
method we use to update our parameter, as well as our trading strategy and some signal
filtering method used to improve our performance. The simulation result and daily return
from our strategy is shown in section 3. We conclude and discuss our challenge in section 4.

1.1

PCA Analysis

As one can imagine, fundamental for the correct implemantion of such strategy is understanding the correlation between the price movements of different assets that make up our
book. We follow the approach as in [AL10] and [Lal+99]. A first approach for extracting our
signal of interest from data is to use Principal Components Analysis (PCA). This approach
uses historical share-price data on a cross-section of N stocks going back M days in history.
For simplicity of exposition, the cross-section is assumed to be identical to the investment
universe, although this need not be the case in practive. Let us represent the stocks return
data, on any given date t0 , going back M + 1 days as a matrix.
Rik =

Si(t0 (k1)t) Si(t0 kt)


, k = 1, . . . , M, i = 1, . . . , N,
Si(t0 kt)

(1)

where Sit is the price of stock i at time t adjusted for dividends and t = 1 minutes. Since
some stocks are more volatile than others, it is convinient to work with standardized returns
matrix Y ,
Yik =

Rik Ri
i

(2)

where
M
1 X

Ri =
Rik
M k=1

(3)

and
M

i2

1 X
i )2
=1=
(Rik R
M 1 k=1

(4)

The empirical correlation matrix C of the data is defined by


M

1 X
ij =
Yik Yjk ,
M 1 k=1
which is symmetric and positive definite. Notice that, for any index i, we have
PM
M
2
1 X
1
2
k=1 (Rik Ri )
=1
ii =
(Yik ) =
M 1 k=1
M 1

i2

(5)

(6)

The commonly used solution to extract meaningful information from the data is Principal Components Ananlysis. We consider the eigenvectors and eigenvalues of the empirical
correlation matrix and rank the eigenvalues in decreasing order:
N 1 2 3 . . . N 0.

(7)

We denote the corresponding eigenvectors by


(j)

(j)

v (j) = (v1 , . . . v2 )

(8)

We will note () the density of eigenvalues of the empirical correlation matrix by


1 dn()
N d
where n() is the number of eigenvalues of C less than . Interestingly, if Y is a T
T
random matrix C () is exactly known in the limit N , T and Q = N
1
and reads:
p
(max )(min )
Q
C () =
2 2
r
1
1
2
max
2
)
min = (1 +
Q
Q
C () =

(9)
N
fixed
(10)
(11)

with [max , min ] and where 2 is qual to the variance of the element of Y .
Let min , . . . , max be the significant eigenvalues in the above sense For each index j,
we consider the corresponding eigenportfolio, which is such that the respective amounts
invested in each of the stocks is defined as
(j)

(j)

Qi =

vi
i

(12)

The eigenportfolio returns are therefore


Fjk =

N
(j)
X
v
i

i=1

Rik

j = k, . . . , l.

(13)

The residual signal then can be generate as the following


= (F T F )1 F T C
C = F
Residual = C C
3

(14)
(15)
(16)

1.2

Residual Prediction with Data Mining Techniques

We also need to select a model to predict the residual signal. We use standard data mining
techniques namely: least squares, random forest, elastic net regression and multinomial
logistic regression.

1.3

Main Challenges

One first challenge that is built in with Random Matrix Theory is that we will have
many zero returns as we have smaller time differences. This will make it hard to
compute the SVD that we need for our eigenvalues/eigenvectors.
The Residuals signals we computed are very small and are easily preturbed by computers.
The biggest challenge is that Residual signals are sensitive to the 2 we choose for the
distribution of eigenvalues
There is reverse relationship between time cost and parameters tunning. Ideally, we
want to tune more parameters and get more stable, accurate results. But we will spend
more time tuning them, which is not so ideal if we are executing in fast time.

2
2.1

Trading Strategy
Parameter Estimation

The walk forward optimization (WFO) methodology will be used to update the choice of
the variance of the elements in the standardized returns matrix. This variance is used to
compute the eigenvalue spectrum of the empirical correlation matrix. The performance of
the parameter to be optimized will be a-posteriori judged in terms of the robustness or
stability of the obtained optimal parameter maximizing a certain objective function. In this
report, we choose the Sharpe ratio to be our objective function. The classical WFO algorithm
was used and we start by building our model using an initial amount of data satisfying the
T
> 1. Using the optimal model obtained in this initial period of data we make our
condition N
first out-of-sample predictions. After the out of sample prediction period ends this segment
of data is added to our in-sample database and we build another predictive model with a
different 2 . This will allow us to update our model and account for any non-stationarity or
new information in the process. It should be noted that it is essential to find a good stable
optimization procedure in order to fit for the parameters used in the modeling of the mean
reversion process.
As we said the optimization is performed using the Sharpe ratio as an objective function
starting with a 7M $ investment (10K$ and following a buy/short trading strategy on the
70 stocks in the XLK technology ETF). The first predictive model is built using data from
T
the first 191 minutes of the trading day which gives a Q = N
2.73 thus satisfying the
4

Figure 1: Walk Forward Optimization

conditions allowing us to apply equations (10) and (11). This model is used to predict the
next period consisting of 120 minutes. When the 120 minutes are over they are added to our
sample and used on top of the initial data to build a new predictive model. This process is
repeated until the end of the day. Using periods of size 120 we end up building two different
models each day having distinct values of 2 . From a preliminary analysis it seems that the
choice of the length of the training periods is crucial for a correct parametrization of 2 ,
indeed if you chose a large number of minutes you might run the risk of over-fitting, whereas
if you chose a small sample the statistical significance can greatly deteriorate.

Figure 2: The 25th of February plot of the Sharpe ratio versus 2 to compute the trading signals.

Figures 2 and 3 show the variation of the Sharpe ratio with respect to 2 for both periods
during a certain day. We show the results for two days where for the first one profits were
5

Figure 3: The 24th of February plot of the Sharpe ratio versus 2 to compute the trading signals.

realized and for the second one losses were incurred. The idea is to try to understand if the
stability or robustness of the optimization procedure will have any impact on the profits and
losses. Indeed for the profitable day we can see in figure 2 that there is a cluster of positive
values of 2 (achieving a high value of Sharpe ratio) and a cluster of negative values, for both
considered periods. It was interesting to note that the positive cluster of values was around
2 = [0.75, 0.85] in accordance with the results obtained by Laloux et al. (2008). Indeed,
this could explain the good results achieved by looking at the accumulated wealth plot for
the raw residuals approach on the 25th February data. On other hand losses were incurred
for the 24th of February data when using the same signal generation approach. Looking at
the graph of 2 versus the Sharpe ratio again we see that the cluster of positive values of 2
is absent and that the objective function was highly oscillatory. This should be considered
as a warning that the methodology might try to overfit the available data. A future possible
extension of this work is to test an important hypothesis. it would stipulate that the absence
of a good cluster of positive 2 values can be considered as evidence against the predictive
power of the model, thus an indication of a possible bearish day.

2.2

Signal Filtering Techniques

The signal is obtained from the residual from section 2. The signal to buy the stock is
when the residual is positive and the signal to short sell the stock is when the residual is negative. However, the residual may has some noise and we may end up in nonmarket neutral strategy. Thus, to achieve the more accurate signal and trade with market neutral strategy, we need to filter out some residual before transforming them into
signal. We provide the ipython notebook to demonstrate this part and the WFO at
https://github.com/mikemeetoo/mse448.

Figure 4: Signal Filtering Techniques

2.2.1

Residual Filtering

When we get the residuals We want to extract the strong signal from them; therefore, we
ignore the small residual with low magnitude by setting it to zero. We pick the ratio to
filter out and set up the banner by
positive banner = max(positive residual) + ( 1) min(positive residual)
negative banner = max(negative residual) + ( 1) min(negative residual)
If the residuals that lie between this banner are filtered out by setting them to zero.
2.2.2

Active Stock Filtering

We also consider only stocks that actively traded by consider ratio of zeros in the return
data. If the return of the stock contians more the ratio of zeros than the threshold , we
will not consider that stock in our analysis.
2.2.3

Sorting Residual

The goal of this part is that we want to obtain the market neutral portfolio. We sort the
filtered residual in order and count the number of positive residual and negative residual.
Then we set the low magnitude residual to zero until the number of positive and negative
residuals is equal. For example,
2.2.4

Generate Signal

After all the methods described above, we generate the signal by consider the sign of the
residual if it is positive we get the signal 1 to by the stock, if it is negative we get the signal
1 to sell the stock.
7

3
3.1

Results
Simulation Settings

We are provided with high frequency data by Thesys from Feb 24, 2014 to Feb 28, 2014.
Based on this data set, we conducted simulations to tune and evaluate different methods.
There are two kinds of simulations with different settings. The first one is for the comparisons across different methods proposed before. Daily profit (Investment returns before
and after the last minute of the day), computational cost (inner loop time), minimum and
maximum wealth are evaluted and compared.
The parameters are selected by tuning in the first part of simulations for comparing
different methods. We used the filtering parameter = 0.3 (except Logistic regression,
Logistic regression is not stable regards to filtering parameter. So we perform Logisitic
regression without filtering) and Active stock parameter = 0.5 for fair comparisons across
all methods.
The other one is for the comparison on different parameters in the parameter tuning
using only Raw Residual signals.

3.2

Simulation Results

First of all, our simulation compares across different methods and shows there is some unstable factor on the last minute of the net wealth we invested (due to dumping all the
positions).
Secondly, we demonstrate the computational cost for different methods which can guide
the feasiblity of implementation in high-frequency trading (shown in Table 1 (a) below).
Among all the methods, Logistic Regression costs from 60 to 170 milliseconds in each innerloop, Random Forest costs from 130 to 280 milliseconds, Elastic Net costs 50 to 200 milliseconds, Least Square costs 16 to 115 milliseconds, while Raw Residual Signal costs 5 to
15 milliseconds. Thus, Raw Residual Signal and Least Square based methods are the most
efficient, and Random Forest is the most time-consuming. However, all the proposed and
developed methods can finish evaluation, prediction and execution with in 300 milliseconds
which make it feasible for high-frequency trading.
Thirdly, across all the models we used, we cant identify one model that is always profitable. Daily profits for each methods are shown in Table 1 (b) and Table 1 (c) below. The
highest daily profit is achieved using Elastic Net while the largest daily loss is achieved using Logistic Regression. And Least Square tends to result in small profit or loss. We need
to perform more back testing to further discover the detailed personality of each methods
and decide which method to implement base on different settings. But, in general, we are
optimistic about the results we get since we do not see which model should be discarded.
Last but not least, the proposed methods are robust at the extent of daily profit. In the
raw residual simulations, if we choose the variance parameter for eigenvalue distribution, we
8

are profitable on every day of the given data.


Strategy
Raw Residual
Least Square
Logistic Regression
Elastic Net
Random Forest

Min Inner-Loop Time (ms)


5
16
60
60
130

Max Inner-Loop Time (ms)


15
115
170
200
280

(a) Inner-loop time cost for different methods

Strategy
Elastic Net
Least Square
Random Forest
Logistic Regression
Raw Residual

1st Day 2nd Day 3rd Day


4th Day
4991.895 1207.83 -1419.98 7315.945
-1426.335 -1543.56 -5180.005 3441.225
3363.33 1465.835 -300.325
4704.34
4213.545 3247.285 -11122.88 2200.35
-3673.71 -1137.92 5296.745 -3831.775

5th Day
-5374.125
-1155.935
-4259.67
-6603.495
3330.235

(b) Investement Returns the minute before last minute of the day

Strategy
1st Day 2nd Day 3rd Day
4th Day
Elastic Net
6223.68 3510.77 -2142.92 7410.985
Least Square
934.48
-208.865 -1159.76
3214.12
Random Forest
1979.275 3693.96 -1108.035 6212.715
Logistic Regression 2755.92 3037.53 -7647.885
250.56
Raw Residual
-2570.84
231.56
5375.98 -4056.205

5th Day
-5079.255
-539.23
-8128.54
-7885.815
3611.49

(c) Investement Returns on the last minute of the day


Table 1: Performance Comparison of each strategies

Conclusion

As it can be seen from table one we are able to generate positive (P) returns on any of the day
considered, though this ability of making profit depends strongly on the chosen optimization
method. This implies that we are just as likely to generate negative (N) or positive returns
on any given day. A closer look at the table reveals that any given method is unable to
generate profit on more than four days out of the five considered, thus, under the study
undertaken, relying on a single strategy seems to be unwise and too risky. This suggest that
in a real-case scenario, in order to correctly implement a winning HFT stats-arb strategy, we
will have to correctly choose which strategy to use out of the five studied and, if more than
one is chosen, what weight to assign in order to maximize returns while minimizing risk. A
first possibility could simply be choosing a static optimal weight for each of the presented
optimization strategy, though from a cursory look at table one, it would seem to be better
to limit future analysis to strategies 1-3-5 or 3-4-5, since in any given day at least two of
9

the strategies give positive returns. Alternatively a continuosly updating weighting process
could be applied, as figure 5 through 11 shows, there seem to be clear daily trends depending
on the optimization strategy choosen. This feature could be exploited to maximize return
by increasing the amount of money invested through a single strategy throughout the day
as it makes profit, while filtering out the negative effect due to negative return strategies.
In summary,under the mindful consideration of a correct computation of our book signal as
well as a careful implementation of our optimization process, the results presented so far
show the feasaibility of implementing a HFT stats-arb strategy.

References
[AL10]

Marco Avellaneda and Jeong-Hyun Lee. Statistical arbitrage in the US equities


market. In: Quantitative Finance 10.7 (2010), pp. 761782.

[Lal+99] Laurent Laloux et al. Noise dressing of financial correlation matrices. In: Physical review letters 83.7 (1999), p. 1467.

A Appendix

10

50

100

150

200

6999000

7001500

Simulation of 7M investment

Wealth

7001000
6998000

Wealth

Simulation of 7M investment

50

Time

100

150

Simulation of 7M investment

200

50

Time

100
Time

(c) Day 26

(d) Day 27

6999000

7003000

Simulation of 7M investment

Wealth

200

6999000

Wealth

7002000

Simulation of 7M investment

50

150

(b) Day 25

6999000

Wealth

7002000

(a) Day 24

100
Time

50

100

150

Time

(e) Day 28
Figure 5: Raw Residuals

11

200

150

200

Simulation of 7M investment

7000500

Wealth
0

50

100

150

6998500

6999500

6998000
6996000

Wealth

7000000

7001500

Simulation of 7M investment

200

50

100

Time

(a) Day 24

200

(b) Day 25
Simulation of 7M investment

Wealth

7000000

6996000

7002000

6998000

7004000

7000000

Simulation of 7M investment

100

150

200

50

100

Time

Time

(c) Day 26

(d) Day 27

7003000

Simulation of 7M investment

Wealth

50

7001000

6999000

Wealth

150

Time

50

100

150

200

Time

(e) Day 28
Figure 6: Raw Residuals after filtered with = 0.3

12

150

200

Simulation of 7M investment
7003000
7002000

Wealth

7000000

7001000

7004000
7002000
7000000

Wealth

7006000

Simulation of 7M investment

50

100

150

200

50

100

Time

150

200

Time

(a) Day 24

(b) Day 25
Simulation of 7M investment

7000000

7004000

Wealth

6999000
6998000

100

150

200

50

100

Time

Time

(c) Day 26

(d) Day 27

6998000

7000000

Simulation of 7M investment

Wealth

50

6996000

6994000

6997000

Wealth

7000000

Simulation of 7M investment

50

100

150

Time

(e) Day 28
Figure 7: Elastic Net

13

200

150

200

Simulation of 7M investment

6999000
Wealth

6997000

6998000

6999500
6998500

Wealth

7000500

7000000

Simulation of 7M investment

50

100

150

200

50

100

Time

(a) Day 24

7004000

Wealth

7000000

7002000

6998000

7000000

7006000

Simulation of 7M investment

6996000

100

150

200

50

100

Time

Time

(c) Day 26

(d) Day 27

7003000

Simulation of 7M investment

Wealth

50

7001000

6999000

Wealth

200

(b) Day 25

Simulation of 7M investment

6994000

150

Time

50

100

150

Time

(e) Day 28
Figure 8: Least Square

14

200

150

200

Simulation of 7M investment

Wealth

6999000
6997000

7002000
7000000

Wealth

7004000

7001000

7006000

Simulation of 7M investment

50

100

150

200

50

100

Time

(a) Day 24

7001000
7000000

Wealth

7003000

7005000

7002000

Simulation of 7M investment

100

150

200

50

100

Time

Time

(c) Day 26

(d) Day 27

6996000

Wealth

7000000

Simulation of 7M investment

6992000

50

6988000

6999000

7001000

Wealth

200

(b) Day 25

Simulation of 7M investment

6999000

150

Time

50

100

150

Time

(e) Day 28
Figure 9: Random Forest

15

200

150

200

Simulation of 7M investment

Wealth

6999000
6997000

7002000
7000000

Wealth

7004000

7001000

7006000

Simulation of 7M investment

50

100

150

200

50

100

Time

(a) Day 24

7001000
7000000

Wealth

7003000

7005000

7002000

Simulation of 7M investment

50

100

150

200

50

100

Time

Time

(c) Day 26

(d) Day 27

7004000

Wealth

Simulation of 7M investment

7000000

6999000

7001000

Wealth

200

(b) Day 25

Simulation of 7M investment

6999000

150

Time

50

100

150

200

Time

(e) Day 28
Figure 10: Random Forest after filtered with = 0.3

16

150

200

Wealth

6998000

7000000

7004000

7002000

Simulation of 7M investment

7000000

Wealth

Simulation of 7M investment

50

100

150

200

50

100

Time

(a) Day 24

7000000

Wealth

6996000

6998000

6996000

7002000

7000000

Simulation of 7M investment

6992000

100

150

200

50

100

Time

Time

(c) Day 26

(d) Day 27

6998000

Simulation of 7M investment

Wealth

50

6994000

6990000

Wealth

200

(b) Day 25

Simulation of 7M investment

6988000

150

Time

50

100

150

200

Time

(e) Day 28
Figure 11: Logistic Regression

17

150

200