You are on page 1of 14

How to Avoid Overfitting Using Robustness

Tests

Whitepaper by TipToeHippo.com

Abstract
When developing algorithmic trading systems one of the primary considerations is what is
commonly called ’overfitting’. This is when the predictive algorithm finds profits in random
noise rather than repeatable patterns. This whitepaper outlines statistical methods which can
be used to identify those systems which are most likely to have discovered repeatable patterns
and thereby increase a developers chance of finding a system which will prove profitable into
the future. The effectiveness of the different methods are statistically tested individually and
collectively and show why algorithmic developers must use them if they want to maximise
their chance of successful systems.
Contents

1 Introduction 1

2 How to create a robust individual system 2


2.1 Definition of Robust . . . . . . . . . . . . . . . . . . . . . 2
2.2 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . 2

3 Techniques to Avoid Overfitting 4


3.1 Hold-out Data . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Robustness of Parameter Values . . . . . . . . . . . . . . . 5
3.3 Walk-Forward Optimisation . . . . . . . . . . . . . . . . . 6
3.4 Monte Carlo Permutation Analysis . . . . . . . . . . . . . . 8
3.5 Similar Market Stability Checks . . . . . . . . . . . . . . . . 9

4 Results 10
4.1 Positive Effects of the Robustness Tests . . . . . . . . . . . . 10

5 Conclusion 12

Contributors
Along with the collective members of TipToeHippo, the contributions to the
topic by Dave Walton, Dr Timothy Masters and Dr Ernest Chan are acknowledged.
We stand on the shoulders of giants.

Version 2.0 | Oct 2021


1. Introduction

1 Introduction

The problem facing algorithmic traders is not discovering systems that perform well on training
data, but to discover systems that work on unseen, future data. It is perilously easy to over-optimise
and ‘overfit’ a trading system - whether introducing unnecessary parameters, tweaking system design
based on out-of-sample performance, or not using out-of-sample data at all, overfitting is a common
pitfall faced by both new and old algorithmic traders. This paper is designed to be a broad summary
overview of our research and findings into overcoming the problem of overfitting in financial markets,
as well as a primer for algorithmic traders looking to broaden their understanding on the topic.
Once the problem of overfitting is defined and, understood, we discuss techniques for its avoidance.
We explain the importance of true, never-before-seen hold-out data, and how it helps select the
strongest, or most robust systems. However, in addition to just a hold-out period, we detail a number
of other statistical tests that can be used to accurately gauge the robustness of the created systems.
Parameter tests should be done to ensure that slight changes in the market do not cause large
changes in the outcome of the system. The market data, and other trading variables such as transaction
costs, should be manipulated to ensure that the system can withstand the worst market conditions.
The system must also not fail to perform on similar markets.
TipToeHippo’s has profitable trading algorithms. They have been developed using rigorous
mathematical and statistical methods. It is hoped that other people wishing to emulate TipToeHippo’s
successful discovery of robust trading systems will be able to gain an overview of the topic and use
the knowledge in this paper as a springboard into deeper research and understanding on the topic.
We present our own conclusions on the effectiveness of different robustness tests in improving the
quality of selected systems. We show that, with their proper use, we see significant improvements
in the live performance of these systems.

Figure 1. TipToeHippo.com results - Development to Forward Test to Live Trading

1
2 How to create a robust individual system

The enemy of good is perfect.

VOLTAIRE

2.1 Definition of Robust

Robust - Adjective:

1. Having or exhibiting strength or vigor

2. Strongly formed or constructed

3. Capable of performing without failure under a wide range of conditions

When completed, an algorithm must be trusted to perform in an unseen future. The challenges
of the future cannot be known, and so the system must be strongly created and stress-tested on a
diverse variety of market situations so that its chance of future failure is low.
To trade a system, the system must be trusted. To trust a system, it must be robust.

2.2 Avoiding Overfitting

When developing an algorithmic trading system, overfitting is probably the most dangerous, and
certainly one of the most difficult problems to overcome. It is dangerous because, counter-intuitively,
the better your system appears on its development data the worse it is likely to perform on unseen
or live market data. The past is never like the future in financial markets. What has come in the
past will not repeat exactly into the future; markets might have similarities to the past, but never
exactness. This is the key. We want the algorithmic system to recognise general similarities to the
past and how to generally profit from them in a general sense.

I would rather be generally right than precisely wrong - J.M. Keynes

If a system is developed and fine-tuned to perform to its maximum potential over the development
data it will develop very specific methods of trading to maximise those desirable values that the
developer wants, and minimise those conditions which are unwanted.
This sounds like a good approach, and exactly why overfitting is dangerous- because in maximising
and minimising results over the test period what seems like improvement is just increasing specialisation
on a market that is now past. The robot is very very good at knowing how to trade a market it has

2
2.2 Avoiding Overfitting

Figure 2. An example of an overfitted system, falling apart after near-perfect performance.

analysed millions, if not billions, of times- but struggles to apply those exact rules in a market it has
never encountered.
An example in another setting is a Mathematics exam. Students can learn how to find the
answer to extremely complex mathematical equations well beyond their expected knowledge for
their grade. The teacher can have them practice over and over these very complex questions and
they can perform perfectly, obtaining the correct answer quickly; commonly known as rote learning.
Come exam time, more simple maths problems are given on the exam, but because the students
have only ever practised rote learnt complex maths problems they are unable to apply common
mathematical rules and principles to obtain the answers.
Computers want to rote learn markets to maximise their performance. Computers cannot rote
learn future markets because they are yet to form, so fail to perform on markets they have not seen
just like the students on unseen maths problems in the example above.
In his book The Encyclopedia of Technical Market Indicators, Colby outlines a system which
he claims proves the ability for the reader to easily create systems that will be profitable. As proof
he creates a system within the book, tested up until the year 2000 (the book was released in 2002).
However, examining his system now, after running through the markets from 2000-2015 shows a
significant loss of performance.
The system has become dangerously unprofitable immediately following the end of its training
period. This kind of reduction in performance is what must be avoided.

3
3 Techniques to Avoid Overfitting

3.1 Hold-out Data

The use of the term out-of-sample data has different definitions depending on the source. Sometimes
it is referred to as test data, or non-training data, other times it is mixed in with validation data, and
frequently it is referred to as out-of-sample data but used in a way that isn’t truly out-of-sample. To
avoid ambiguity, here we refer to single-run out-of-sample data as ‘hold-out data’.
When training the model use one period of data. Then when testing the validity of what has
been created use a separate never-used second period of data. For example, train and develop
using data from the period 1/2011 to 12/2018. Use that data over and over again for all the testing.
Once the development and testing has been completed and the system is the most robust and best
performing possible the system is taken and run once on the period 1/2019 to 10/2021.
This test on the never-seen hold-out data gives a statistically unbiased estimate on how the
system will perform in the future. It is effectively a time machine! The system can be tested as
if on a demo account for this hold-out period. In our example, more than 2 years of live demo
testing can be conducted instantaneously. This result is an excellent indicator of how the system
will perform going forward, especially if this hold-out data statistically matches the training data.
If this consistency occurs it demonstrates that the system is trading the underlying market patterns
that continue to occur within the market, rather than noise that it fitted to in the past.
The never-seen result is due to two factors – the underlying trading of repeatable patterns and
luck (both good and bad). The longer the hold-out period the greater its statistical significance. In
a short hold-out period, the system may be lucky and give results better than might be expected
going forward purely due to random chance. In a long hold-out period this good luck is likely to be
balanced by alternative periods of bad luck evening out the effect of luck. An even spread of good
and bad luck over an extended period of time reduces the overall effect of luck in the hold-out results
– leaving only the results from the underlying repeatable patterns. This gives greater certainty that
the system will continue to work into the future.
It is important to run this test as a final test only once. Should it not meet the high expectations
the whole system and all its development should be binned and the entire development process
restarted from the beginning. The point of the hold-out data is to judge the system, not to improve
it. If the system is improved using this hold-out data it is oxymoronically no longer hold-out data but
training data - the improvements are merely overfitting. We believe that the proper use of hold-out
data is the best tool available for system developers to test the robustness of their systems. As much
time as possible should be set aside for it. It is fortunate that the period 2019-21 contains some
dramatic markets during the period of the pandemic, giving the systems an excellent robustness
check.
This data is precious; it can only be used once. The quality of the system must be very high
before being used on this hold-out data. Every other test listed below in this paper should be
conducted before using the hold-out data. Be as sure as possible that this is the system, because

4
3.2 Robustness of Parameter Values

there is only one shot at the out-of-sample data. Once that shot is used, the system either goes live
- or goes into the bin.
A recommended author on this topic is Dr. Ernest P. Chan. In his second book ‘Algorithmic
Trading’ he writes the excellent summary quoted below:

The way to detect data-snooping bias is well known: We should test the model on out-of-sample
data and reject a model that doesn’t pass the out-of-sample test. But this is easier said than done.
Are we really willing to give up on possibly weeks of work and toss out the model completely?
Few of us are blessed with such decisiveness. Many of us will instead tweak the model this way
or that so that it finally performs reasonably well on both the in-sample and the out-of-sample
result. But voila! By doing this we have just turned the out-of-sample data into in-sample data.

Dr Ernest P Chan

3.2 Robustness of Parameter Values

It is common to find parameter values for a trading system which appear to be optimal, only for
them to be sub-optimal when trading into the future. We find it useful to test not only for optimal
values of parameters, but for optimal areas. To explain this, take an example system that uses an
ATR indicator, with an adjustable period length set to 14. In an ideal system, we want the system
to perform generally well with the ATR value of 14 as it would on the nearby values of 10 and 18.
A dramatic shift in the success of a system from a small change in a parameter value should be a
warning sign that this parameter, and likely the entire system, is overfitting.

Figure 3. Stable area parameter selection

5
3.3 Walk-Forward Optimisation

The system does not need to perform equally well on all surrounding parameter values, but it
needs to be tested for if a change in variable values causes a complete collapse in the profitability
of the system. The basis for our statistical testing is outlined in Know Your System! - Turning Data
Mining from Bias to Benefit Through System Parameter Permutation by Dave Walton.
The most robust systems return not the highest profit, but an area of variables that provide
the most robust and stable outcomes as variable values are slightly changed. If thought of like a
topographical map, it is not the highest individual peak that is the optimal outcome- but the largest
area of hills. By stress-testing each parameter of our system, we reduce the likelihood that a value or
threshold of an indicator has been discovered by chance. An accompanying range of surrounding
values producing similar levels of profitability is a further assurance that the system will continue
to perform into the future. If the performance falls away quickly from a high individual peak then
the system is overfitted and not robust - this system should be rejected.
Each system developed by TipToeHippo has its variables run through thousands of different
permutations to ensure that the selected parameters remain profitable if moved away from their
selected values.

3.3 Walk-Forward Optimisation

Walk Forward Optimisation (WFO) is a method of robustness testing where the built system’s parameters
are testing using a rolling optimisation & backtest pattern to compile results where the parameters
are being tested on non-optimised data.

Figure 4. Stages of Walk-Forward Optimisation

Figure 4 gives a visual example of Walk-Forward Optimisation. Note how there is a period of
optimisation of the system’s parameters, followed by a backtest period. By rolling this pattern
forward through time a series of backtests are created with parameter values which have not been
optimised on the period that they have been backtested on.
It is important to remember that the system has been developed upon the same data that the
WFO is conducted over. So WFO does not create truly unseen data, nor does it turn training data
into out-of-sample data. What WFO does do is give an indication of how the parameters react to
the optimisation and backtest sequence. If the walk-forward optimised parameters perform well
across the backtest, the system is shown to be robust. It gives confidence that the system is strong
enough to continue to work in the unknown future using parameters on data it was not optimised
on.

6
3.3 Walk-Forward Optimisation

The important thing to realise is this test is evaluating the sensitivity of the parameters to failure
on non-optimised data. It is not a magic bullet that solves overfitting - the system is still built on the
same training data that the WFO uses. WFO demonstrates that the system can operate successfully
on forward data for which its parameters are not optimised. The historical data is broken down into
multiple time periods. Through each period the system is optimised, before being run through the
next section with these new optimised parameters.
We view WFO as a valuable test of robustness. The market is constantly changing, including
the market noise. WFO tests that the system is able to adapt to these changing conditions – trading
the underlying market patterns while avoiding getting faked out by noise. A strong performance in
WFO gives encouragement that the system will continue to perform in the future.

Figure 5. Walk-Forward Matrix (WFM) result

TipToeHippo runs a matrix of 152 individual WFO tests using different optimisation and backtest
periods. Figure 5 shows the matrix results of a current system currently trading in TipToeHippo’s
Index Portfolio. 152 different WFO tests were conducted. 139 of them gave consistent results demonstrating
that the system is robust and expected to continue to perform into the future. Our program takes
proprietary performance measurements of each run and displays a ’pass’ or ’fail’ outcome when
compared to the original system’s results.

7
3.4 Monte Carlo Permutation Analysis

3.4 Monte Carlo Permutation Analysis

Monte Carlo analysis takes the financial data that the system has been developed on and modifies
it so that it is slightly, but significantly different to the original data. What Monte Carlo is doing is
removing the ‘luck’ or ‘noise trading’ factor from the results. Markets are very noisy. An important
economic announcement which moves the market a great deal is actually regarded as noise- because
it is not a reoccurring pattern. If a system is too powerful and ‘learns’ to trade these big, but noisy
market movements it is effectively treating this noise as if it is a legitimate repeating pattern. These
patterns (because they are actually noise) don’t repeat so our trading system will get faked out when
in the future it recognises similar noise that does not play out as a repeatable pattern. That is what
overfitting is- the learning of unrepeating patterns causing fake outs in the future.

Figure 6. Results of a Monte Carlo Permutation Test

Monte Carlo tests if this is a problem by manipulating the price data- changing the ‘noise’. The
volatility is raised and lowered, the highs and lows of a bar manipulated and the spread raised.
Entries are also delayed or brought forward. This changes the entry point for systems and a robust
system should not be highly dependent on exact trade entry times. The same is done for exits. The
results are compared to the original and if they are within expectation than the system is considered
robust. If the system performs within expectations compared to the unadulterated data then it
is considered robust and not overfitted because the data has been changed, but the performance
remains strong.

8
3.5 Similar Market Stability Checks

TipToeHippo performs the following Monte Carlo tests, each with 1000 repetitions:

• Market Price Data

• System Parameter Values

• Trade History

• Transaction Costs

• Trade Order

Each system is stress-tested in each of these areas to ensure that it remains profitable over
different permutations of the variables involved in trading.

3.5 Similar Market Stability Checks

If a system is developed on the Dow Jones (US30) and passes the robustness tests described above-
then it should also perform on a similar, but different market like the DAX (GER30). Every system
is dependent on the developed parameters. It is a product of the data it has been developed over-
to trade that financial market in a profitable way.
TipToeHippo places great emphasis on ensuring that a system is not overly dependent on having
known financial data to be profitable. The future is unknown, so in an effort to ensure our systems
will perform in this future unknown data we conduct multiple tests. Only once a system passes
all of those tests do we then consider running it on hold-out data. We expect a system that works
on an index to be profitable on three separate indices and the AUDUSD currency pair for it to be
considered robust.

9
4 Results

The theoretical reasons for conducting various robustness tests have been discussed. Below in
Figures 7 & 8 we present the statistical evidence for the robustness tests’ effectiveness.

4.1 Positive Effects of the Robustness Tests

These figures show the improvement in the percentage of systems which survive the hold-out data.
The salient points are as follows:

• Without any robustness tests, 46% of generated US30 systems passed hold-out in profit.

• After robustness tests, 87% of the remaining systems passed the hold-out test.

• Individually, the robustness tests provide a relatively small improvement to the hold-out pass
rate.

• However, when combined in series, we see a near doubling of hold-out pass rate from 45.7% to
86.8%.

Figure 7. DOW (US30) Hold-out Improvement Metrics

10
4.1 Positive Effects of the Robustness Tests

These results are generally replicated when the same tests are conducted on the DAX.

Figure 8. DAX (GER30) Hold-out Improvement Metrics

These results show the importance of using multiple robustness tests in a system development
process. Individual tests can help improve the pass rate but, when combined together, the robustness
tests massively improve the quality of the remaining systems. Due to the different nature of each
test, these can intuitively be thought as stress-testing different parts of the system.
If any individual part of the system fails then that system is not robust, and is discarded. However,
if a system passes all of the different tests then it is as robust as is possible.
Before a system is selected as a candidate, we ensure it passes all these tests. When selecting
our champion candidate we only want the strongest, best performing and robust contender. When
a strong, robust system passes hold-out we have every confidence it will continue to perform well
into the future.

11
5 Conclusion

This paper has outlined various methods of robustness testing algorithmic trading systems.
Each of these different robustness tests serve a purpose of stress-testing different aspects of a system.
When combined in series, they dramatically help overcome the critical algo-trading problem of
‘overfitting’.
By respecting the importance of true, never-before-seen hold-out data, an accurate portrayal
of how the system would have performed on unseen forward data is achieved. By using multiple
robustness tests in series, it ensures that systems not only perform well on training data, but also
continue to perform well on unseen and future live data. These ideas are not only true in a theoretical
sense, but are also shown to be statistically valid when applied to a pool of randomly generated
systems. When traders are developing their algorithmic systems, strictly following a robustness
‘roadmap’ will dramatically increase their chances of finding long term success.
When TipToeHippo conduct correct robustness testing, we gain a massive improvement in the
quality of our pool of systems. These high quality candidates allow us to select the individual system
to forward test over the hold-out data with a high degree of confidence. We are confident that not
only will the system perform profitably in the hold-out forward test, but it will also continue to
perform consistently well when trading live on our portfolio into the future.
With robustness testing and correct use of hold-out data, TipToeHippo has achieved consistent
results from Development to Forward Test to Live Trading (Figure 9).

Figure 9. TipToeHippo.com results - Development to Forward Test to Live Trading

12

You might also like