Do Not Fall into These Financial Back-Testing Traps

Get started Open in app
Sofien Kaabar
7.3K Followers About Follow
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Do Not Fall into These Financial Back-Testing

Traps.
Do not be Misled by these Errors When Back-testing your Trading Strategy.
Sofien Kaabar Oct 14, 2020 · 14 min read
Back-testing is your proof of work for any strategy you might think of. If the trader’s
friend is the trend, then back-testing is his best friend. Many traders out there just “accept”
the mainstream strategies that are already out there and use them in hopes of becoming
rich.
The correct framework is to get ideas from here and there, build-up on them, combine
them, bring in a touch of your creativity, and create your strategy with your own risk
profile. Trading has a huge research and analysis component in it, and it should not
really be delegated to online common knowledge. The below are some common back-
testing traps and how to overcome them.
I have just published a new book after the success of New Technical Indicators in
Python. It features a more complete description and addition of complex trading
strategies with a Github page dedicated to the continuously updated code. If you feel
that this interests you, feel free to visit the below link, or if you prefer to buy the PDF
version, you could contact me on Linkedin.
The Book of Trading Strategies

Amazon.com: The Book of Trading Strategies (9798532885707):
Kaabar, Sofien: Books
www.amazon.com
Overfitting and Underfitting

After reading the brilliant work of Dr. Marco Lopez De Prado on biases a few months
ago, I have come to understand that there is always something that escapes us no matter
how perfect we think the analysis was.
Overfitting remains unfortunately very common even among professional practitioners.

It is a difficult obstacle to overcome, and the only way to avoid it is by respecting a few
guidelines that will be outlined later. But we should understand first what overfitting is.
Overfitting is when a model forecasts data using a relationship that is so close to the past
values that it fails to account for the general existing relationship. Thus, on the training
period, the results will be good but on the testing period, they will be disappointing.
Underfitting is the opposite of overfitting, it is the equivalent of a model that has not
done its homework to fully understand the data. We have what we call a bias and a
variance problem which are both considered “fit” issues:
Bias is also known as underfitting and this is simply when the model encounters a
signal and thinks it is noise.
Variance is also known as overfitting and this is simply when the model encounters
noise and thinks it is a signal.
The problem in real life is that by reducing one, you are increasing the other, hence, you
are making a trade-off between the two. You want to find the right balance between
sacrificing one for the sake of the other, this is called the Bias-Variance Trade-Off.
How can we fix Overfitting?
Decreasing the model’s complexity by removing layers to its calculation

methodology or by removing variables that may explain the nature of the
relationship between the explanatory variables and the dependent variable.
Decreasing the training period so that model does not exactly fit the training data.
If you are using neural networks, consider a dropout function and an early stopping
technique.
How can we to fix Underfitting?
Increasing the model’s complexity by adding layers to its calculation methodology or

by adding variables that may explain the nature of the relationship between the
explanatory variables and the dependent variable.
Increasing the training period so that model finds more data to work with and be
able to make predictions based on sufficient information.
Naturally, when we want to back-test our model, we want it to use its knowledge on
unseen data, so we split our historical data into two with the first set called the training
set and it is the one that the model will analyze while the second (more recent) one
called the testing set, and it is the one that the implied relationship will be calculated on.
In other words, the model has understood the relationship by looking at the training set
and is now expecting it to continue on the test set, so, it will predict on the test set.
Summary Table. (Image by Author)
If we want to be technically correct, the only out-of-sample period is the one we haven’t
seen yet (i.e. the future), but for now, we can consider the test set to be out-of-sample.
Overfitting and Underfitting occur during the in-sample period and we see the
disappointing results on the out-of-sample period.
Forgetting Transaction Costs

Something you should always be aware of is that back-testing results are mostly false.
You are likely to never get a good estimate of future results except perhaps by luck. You
cannot accurately estimate the actual fees, spreads, slippage, and any other unexpected
events that will occur during live execution and therefore when including a proxy of
these costs in your back-tests, it is always helpful to bias them upwards.
For instance, let us assume that the average historical bid-ask spread on the USDCAD
pair given by your broker is 0.6 pips, the best thing to do is to suppose that the actual
spread is at least the historical average plus a margin for all the unexpected costs. This is
an example of upward biasing because we know that by time, bid-ask spreads are getting
more competitive (i.e. smaller) which is a positive thing regarding market efficiency.
A more detailed example in the following table:

The best argument for biasing the costs upwards is to escape from unpleasant surprises
during live trading as well as to test the robustness of the model when encountering a
volatile environment.
The disadvantage of doing so is that many short-term models will get filtered out, for
instance, models that run on M5 and M15 time frames are more sensitive to costs than
models that run on hourly time frames and thus cost management is imperative for the
model to be able to provide consistent results.
However, if your model depends on maximizing the accuracy of expected transaction

costs, then it is helpful to know that they have been proven to be non-linearly correlated
with certain variables such as actual volatility. However, a more simplistic example
would even be to try running a regression using past variables to explain the historical
costs and assuming that the relationship will hold over the short time frame. The back-
tests will use just a small number of these performance metrics and the transaction costs
will be arbitrary. The conclusion of this point is that you should never run a back-test
without incorporating transaction costs. In the articles I publish, I always include bid-
ask spreads even when I do not mention them.
Look-Ahead Bias
This bias is known in the field of back-testing and research and although it is considered
more of a rookie mistake, it must never be forgotten. Look-ahead bias is when you are
using the future to predict the past. Consider a strategy that relies on the daily closing
values of the S&P500 to make predictions. When designing the model, you erroneously,
use today’s close to predict today’s close, hence, you are using a future information to try
to predict the future and this is of course unfeasible.
The table below shows quarterly GDP data for the United Kingdom in 2015
Notice that the third quarter GDP which usually refers to the period between
01/07/2018 and 30/09/2018 is published on 09/11/2018, hence, a lag of more than a
month. Any researcher should be careful not to put these figures with other figures that
actually do get released on 30/09/2018. Using and assigning data that would otherwise
have not been available on that date is referred to as look-ahead bias. Failing to make the
appropriate adjustments will make the whole analysis erroneous, unrealistic, and
impossible to reproduce. Economic data particularly suffers from this problem and
adjustments should be made to take into account the lag.
With some governmental data, changes and updates can be made and thus introducing
errors to the model. For instance, the above GDP table has had many revisions for it to
finally settle on the above values. Data update also contributes to the fact that correct
information might come even later than expected. After correcting for look-ahead bias,
we may find ourselves in front of a newly revised value that could alter predictions and
temper with the training of the model. A possible and naïve way to correct for this issue
is to only use the preliminary releases and disregard any updates, as it is understood that
they will only be known at a later time in the future, it becomes useless to incorporate
them inside the model. We deal with what we have now.
Not Accounting for Market Regime Changes

Most of the time, markets are either trending or ranging and while we can develop
strategies for both market regimes, it is difficult to find one strategy that is able to
capture the change in the regime and adapt itself all while continuing to be profitable.
This Strategy-of-Everything is unlikely to exist at the moment as financial time series
are highly complex and dynamic.
When performing a back-test, you have to make sure to define what type is your strategy
and on which regimes (states) will you be testing it on. Let us look at an example,
consider a strategy that goes long (initiates a buy order) each time we have three
consecutive lower lows and goes short (initiates a sell order) each time we have three
consecutive higher highs. It is clear that this is a contrarian strategy more suited inside a
range.
Now, if we apply it on a trending market such as the S&P500, what would be the result? I
will give you a hint: Bad.
The S&P500 Index. A Mostly Trending Market.
The answer is clear, trending markets require trend-following strategies. But what about
markets that alternate between ranging and trending? In that case, we need other tools
to approximate the current state and use a proper strategy. This can get complicated very
fast. Here is an example of a trend-following strategy I like to use from time to time:
An exotic technical trend-following strategy in FX.

A personal trading strategy I like to use to follow the trend.
medium.com
Non-Stationary Data
Stationarity is synonymous to a constant mean over time. A changing mean will cause
the model to produce erroneous forecasts. A time series data is stationary if it has a
constant mean and variance, that is, its mean does not change much by time. The same
goes for its variance (volatility).
In other more technical terms, stationarity is when prices diffuse at a slower rate
than a geometric random walk. Financial data have too much noise and differencing
or taking log returns will make them almost stationary at the cost of losing their
memory, but that is the best we have got right now at the very basic level.
Note that we are talking about feeding our machine learning models with inputs to produce
a forecast.
A stationary data series with a mean ≈ 0.09. (Image by Author)
As time series (prices per se) exhibit significant autocorrelation in small intervals of
time, it is rational to assume that it is quite easy for the model to deliver such good
results. When you have a machine learning algorithm that has predictive power, you
must use it on stationarity data otherwise, the results will be false.
As time series (Prices) are significantly

autocorrelated, the model will always follow the latest
value, and therefore, it will likely reproduce the
previous value and call it a forecast. When time series
are transformed into stationary data either by
differencing or by taking returns, this problem more
often than not goes away.
Let us see what happens when we use a normal auto-regression technique to forecast
Bitcoin. The first test will use non-stationary (i.e. BTCUSD prices) data and the second
test will use stationary data (i.e. BTCUSD returns).
Test #1: Non-Stationary Data
In layman’s terms, we will apply a machine learning model to actual prices (and not
returns) of Bitcoin relative to the US dollar (BTCUSD pair) and evaluate our predictions
on the out-of-sample dataset. Below are the rules and results of the predictions:
Rules
Test asset: BTCUSD
Model used: Linear regression.
Training days: 2221.
Testing days: 100.
Results
R-square: 0.98
Interpretation
The R-square means that our model explains 98% of the variations, that is a 0.99
correlation between the predicted values and the actual values. A utopian model like
this does not exist in such a complex world, and whoever is capable of making such a
model will be a billionaire in less than a week.
Something must be wrong here, and there is. First, we take a look at the below graph
showing the high linear correlation between the predicted and real values using a simple
linear regression model that uses past prices as explanatory variables to guide it with
finding future prices.
R-square between the predicted and actual values shows a superior modelling power of our algorithm. (Image by
author)
However, if we plot the actual and predicted values on a line chart to better visualize the
correlation and see how good our model is doing, we see something strange. Indeed, it
seems that our model is following the actual values with a lag of one. If today’s values go
up, our forecast for tomorrow is also up and vice versa. It appears that we are not really
doing anything but repeating yesterday’s news. Not only does this model not have a
predictive power but over time the transaction costs will eat any stochastic (random)
profits that may come with luck.
Actual values vs predicted values. The model seems to be simply replicating the value of yesterday. (Image by
author)
Test #2: Stationary Data
The models we use are based on the fact that the time series is stationary which in turn
will provide a real forecast, that is, the model is actually doing something useful.
Evidently, most machine learning models should be used to predict the differences
between prices (in the case of asset predictions). Let us now, repeat the above
experiment with the exact same rules but only this time we will be using returns data
(differenced data can also be used). The lagged period is also the same.
The results show that now the model actually does a bit worse than a random walk
model which might suggest the same for the dataset. Are the fluctuations of the
BTCUSD random? It takes more analysis to answer that question but for now we can
safely say that our algorithm with the actual parameters cannot forecast the direction of
the asset. This is obvious, because a simple linear model cannot predict a highly complex
market.
If you are also interested by more technical indicators and using Python to create
strategies, then my best-selling book on Technical Indicators may interest you:
New Technical Indicators in Python

Amazon.com: New Technical Indicators in Python: 9798711128861:
Kaabar, Mr Sofien: Books
www.amazon.com
Rules
Test asset: BTCUSD returns.
Model used: Linear regression.
Training days: 2221.
Testing days: 100.
Results
R-square: 0.03
Accuracy: 49%
The plot of predicted vs actual values shows no correlation between the two whatsoever, giving a stronger
conviction of the underperformance of our model. (Image by author)
The returns of BTCUSD were much more volatile than predicted with the linear regression model. The model has
been wrong about 51% of the time. (Image by author)
Another measure worth mentioning in the case of a linear model is the R-square. This
goodness-of-fit measure is very common in econometrics. It is the percentage of the
dependent variable that is explained by the independent variable(s). Before we
introduce the formula (that is very simple), we must mention two calculations, SSE
(squared sum of errors) is the unexplained part by the model and the SST (squared sum
of totals) is the unexplained plus the explained part by the model. Intuitively, from the
formula below we can see that the R-square measures the percentage explained by the
model.
Focusing too Much on the Hit Ratio

Sure, an 80% hit ratio on your trades is great. But what if you are risking $1 each time to
earn $0.20 (20 cents)? Well, then you will lose money and get wiped out because your
risk reward will be 0.2. If you make 100 trades where you always use the same position
sizing and you get your 80% hit ratio which translates to 80 profitable trades with each
gaining $0.20.
This gives you a profit of 80 x 0.2 = $16. Alright, not bad but let us see the remaining
losing 20 trades which have lost $1 each. This gives you a loss of 20 x 1.0 = $20. Your
net profit is therefore -$4.00. Hence, by getting it right 8 out of 10 times, you have
managed to lose money. How to fix this?
Risk-reward Trade-off and the Hit Ratio. At a risk-reward Ratio of 1.00, we need 50% to breakeven. (Image by
author)
We have to expect at least $1.8 for every $1.0 we are risking. This gives us a margin to
wiggle with. With a risk-reward ratio of 1.8, we only need a hit ratio of 35.70% to
breakeven. Thus, consider evaluating a strategy that had 40% hit ratio with a risk-
reward of 1.82 and using the same position sizes.
Total profit = 0.4 x 1.82 = +$0.728
Total loss = 0.6 x 1.00 = -$0.600
Net profit = 0.728–0.600 = +$0.128
To compute the required hit ratio to break even, you can use the following formula:
Not Taking Into Account Yearly Performances

Look at the below equity curve and tell me what do you see? Clearly, it is upwards
sloping and looks attractive. After all, you have started with $1,000 and now have
around $4,500. Now, let us zoom in.
An example of a Strategy that has produced positive cumulative returns. (Image by author)
Although it does look good, when we take the years one by one, we find some losing
years that are much less attractive and can actually wipe us out if we start trading this
strategy at the wrong time. This begs the question, if we were truly trading based on
this strategy and had a bad year, would we continue? Unfortunately, we will never
know and that is why we need a strategy that wins most of the time (years) and not one
that greatly outperforms in a few years but spends most years losing money or being
flat.
We should look at the evolution of the strategy and not just stick to basic performance
statistics. The above strategy had a 61.67% Hit ratio but still manages to be somewhat
bad.
Not Taking Into Account Risk Management Before Going Live

When you back-test a strategy, you must account for stops and profit orders. In other
words, when you do apply the strategy in real life, you will place stops and profit orders.
You should know that these orders change drastically the performance. Here is the RSI
strategy with and without placing fixed stops.
EURUSD Hourly with 14-Hour RSI. (Image by author)
Following a “buy when the RSI(14) touches 20 and sell when the RSI touches 80”
strategy, we get the following results for both tests:
Comparison between the two RSI Strategies. (Image by author)
Notice the huge difference between the final results? Even though they are both
negative, the one without risk management did much worse. We do not want that to
happen when we switch from virtual to real time trading.
Conclusion
There will always be some form of bias in the back-test. Our job as researchers and
traders is to minimize them so as to maximize the probability of realization. We are all of
course familiar with the saying that the history does not repeat itself or that the past is
not a reflection for future profits but the past is the best we have in our fight against the
future. If you manage to at least incorporate some of the above points, then you are
likely on the right track. Remember, finding your strategy is not an overnight process, be
patient.
Image by Nattanan Kanchanaprat from Pixabay
Get an Email Whenever I Publish a New Story.
Your email
Subscribe
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about
our privacy practices.
Finance Trading Investing Machine Learning Artificial Intelligence
About Write Help Legal
Get the Medium app

Do Not Fall into These Financial Back-Testing Traps

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Do Not Fall into These Financial Back-Testing Traps

Uploaded by

Copyright:

Available Formats

Get started Open in app

Do Not Fall into These Financial Back-Testing

Sofien Kaabar Oct 14, 2020 · 14 min read

The Book of Trading Strategies

Overfitting and Underfitting

Overfitting remains unfortunately very common even among professional practitioners.

How can we fix Overfitting?

Decreasing the model’s complexity by removing layers to its calculation

How can we to fix Underfitting?

Increasing the model’s complexity by adding layers to its calculation methodology or

Summary Table. (Image by Author)

Forgetting Transaction Costs

A more detailed example in the following table:

However, if your model depends on maximizing the accuracy of expected transaction

Not Accounting for Market Regime Changes

The S&P500 Index. A Mostly Trending Market.

An exotic technical trend-following strategy in FX.

A stationary data series with a mean ≈ 0.09. (Image by Author)

As time series (Prices) are significantly

Test asset: BTCUSD

Model used: Linear regression.

Training days: 2221.

Testing days: 100.

New Technical Indicators in Python

Test asset: BTCUSD returns.

Model used: Linear regression.

Training days: 2221.

Testing days: 100.

Focusing too Much on the Hit Ratio

Total profit = 0.4 x 1.82 = +$0.728

Total loss = 0.6 x 1.00 = -$0.600

Net profit = 0.728–0.600 = +$0.128

Not Taking Into Account Yearly Performances

Not Taking Into Account Risk Management Before Going Live

EURUSD Hourly with 14-Hour RSI. (Image by author)

Comparison between the two RSI Strategies. (Image by author)

Image by Nattanan Kanchanaprat from Pixabay

Get an Email Whenever I Publish a New Story.

About Write Help Legal

Get the Medium app

You might also like