You are on page 1of 4

How “backtest overfitting”

Finance

in finance leads to false


discoveries

Financial investment strategies are often designed


and tested using historical market data. But this can
frequently give rise to “optimal” strategies that are
statistical mirages and perform poorly out in the real world,
as David H. Bailey and Marcos López de Prado explain

A
common goal of investment funds is researchers would be particularly careful typically have numerous parameters and
to deliver a higher percentage return when conducting statistical inference. Sadly, choices. Suppose that an investor believes
than the overall market without the opposite is true. that there may be monthly patterns in certain
incurring a greater probability of a A leading reason for the failure of sets of stocks that may lead to a profitable
financial loss. To devise investment strategies investment models is backtest overfitting strategy, say by purchasing shares on a fixed
to achieve this goal, firms and analysts typically (see the Glossary, page 25, for a definition day of the month, and selling on another
feed historical market data into computer of this and other italicised terms). This fixed date. There are many variations for such
programs that test a multitude of combinations occurs when historical market data is used a strategy, as illustrated in Figure 1.
of financial instruments, weighting factors, to develop an investment model, fund or Note that even with this simple investment
decision points and other parameters, all to strategy, but too many model variations are strategy (which, by the way, is very unlikely to
identify an “optimal” design. With this “optimal” tried relative to the amount of data available. produce reliable market-beating profits), there
design in hand, they tout the potential return It is a form of selection bias under multiple are 435 choices just for the start and end dates
that an investment based on this design is likely testing. Models, funds and strategies suffering of each monthly investment cycle. Admittedly,
to deliver, based on its simulated performance from this type of statistical overfitting typically not all of these choices count as independent
on historical data – a process known as target the random patterns present in the trials, but each additional choice raises the
backtesting. However, in all too many cases, limited in-sample test data on which they probability of a fluke. In any event, it is clear
such investments deliver only disappointing are based, and thus often perform erratically that designing such a strategy by searching
performance when fielded.1 The “optimal” when presented with new, truly out-of-sample via computer over the space of all parameter
design turns out to be a false discovery. data. The sobering consequence is that a combinations, in order to design an “optimal”
Three features of financial research substantial portion of the models, funds and strategy, is virtually certain to produce an
make this field particularly prone to false strategies employed in the investment world overfitted backtest, unless one explicitly
discoveries. First, the probability of finding may be merely statistical mirages. guards against it using rigorous statistical
a profitable investment strategy is very low, tools and a solid economic rationale.2
due to intense competition. Second, true Designing investment strategies
findings are mostly short-lived, as a result by computer search Overfitting in the design of
of the rapidly changing nature of financial The potential for backtest overfitting in stock funds
systems. Third, unlike in the natural sciences, the financial field has grown enormously Consider the problem of designing an
it is rarely possible to verify statistical in recent years with the increased use of investment fund to meet some desired
findings through controlled experiments. In computer programs to search a space performance profile. One increasingly
the absence of controlled experiments, it is of millions or even billions of parameter popular investment product is the exchange-
virtually impossible to debunk a false claim. variations for a given model, fund or strategy. traded fund (ETF), namely a mutual fund that
One would hope that, in such circumstances, Even very simple investment strategies may be freely traded during the day like an

22 SIGNIFICANCE December 2021 © 2021 The Royal Statistical Society


17409713, 2021, 6, Downloaded from https://rss.onlinelibrary.wiley.com/doi/10.1111/1740-9713.01588 by Eth Zürich Eth-Bibliothek, Wiley Online Library on [23/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
FUNDAMENTALS Finance
November 2015

Figure 2. Graphic Representation of the Event Study


individual stock or bond. In the USA alone,
there is currently over $5 trillion invested in ETF Launch Date
ETFs. Hundreds of new ETFs are minted each
year, many of them following some custom-
designed index (i.e., a custom-designed
set of stocks and weights). In a 2012 study, Index Performance Index Performance
researchers found that the median time t ∈ [-36,0] t ∈ [0,36]
between the definition of a new index and
the inception of a new ETF based on that
index dropped from almost 3 years in 2000
to only 77 days in 2011. As a result, “most
indexes have little live performance history
for investors to assess in the context of a new
Source: Research Affiliates, LLC.
ETF investment” (bit.ly/3BF1hU9).
megaflopp/Bigstock.com

How do these newly minted index ETFs


perform? A 2015 study computed the
performance of all ETFs that were launched in
the US market from 1993 to 2014. Researchers
found that the investment strategies underlying Figure 3. Three-Year Cumulative Relative Index Performance
those ETFs delivered average annual excess Before and After ETF Launch
returns of approximately 5% prior to their launch Figure 1: Illustration of an investment strategy built through trial and error.
(i.e., in backtests). This strong performance Index Relative Performance Three Years Before & After ETF Launch
contrasts with average annual excess returns 1.4
Average Application
of approximately 0% when used out-of-sample Date to SEC
(see Figure 2).1 Such disappointing behaviour 1.35
is entirely consistent with a design process that
involves extensive computer exploration of index
Cumulative Value of Excess Return

1.3
parameters and selecting only the “optimal”
parameters for an index fund subsequently
fielded in the financial markets. 1.25

Mutual funds, forecasters 1.2


and anomalies
In the past few years it has become clear to
1.15
many individual investors that few mutual
funds or other financial investments can
consistently generate gains above the overall 1.1
market averages. For example, a 2019 report
found that among actively managed funds 1.05
(i.e., funds whose stocks are actively selected,
bought and sold by experts at a financial
1
firm) in the “US large value” category, only -36 -30 -24 -18 -12 -6 0 6 12 18 24 30 36
8.3% beat the comparable passive index fund
No. of Months since ETF Launch (t)
(i.e., a fund where no attempt is made to
manage, except to follow a relevant broad- Figure
Source:2: Backtested
Research performance
Affiliates, LLC, using dataagainst performance out-of-sample. Appears as Figure 3 in Brightman, Li
from Bloomberg.

market index) over a 10-year period. Among and Liu.1


“world stock” actively managed funds, only
620 Newport Center Drive, Suite 900 | Newport Beach, CA 92660 | + 1 (949) 325 - 8700 | www.researchaffiliates.com Page 3
26.3% beat the comparable passive index in the media their success at predicting correlation between the average forecast
fund over a 10-year period (bit.ly/3eS54nt). some events, while hoping that the audience and the year-end price of the S&P 500 index
In other words, very few actively managed has forgotten an equal or greater number for the given year, these predictions were
funds have beaten the overall market of false calls. In 2016, Nir Kaissar analysed surprisingly unreliable during major shifts
averages over the long haul. a set of predictions by professional market in the market. For example, Kaissar found
The issue of selection bias reaches far forecasters over a 17-year period from 1999 that the strategists overestimated the S&P
beyond the realm of quantitative investing. to 2016 (bloom.bg/2WljmGO). He found 500’s year-end price by 26.2% on average
Prominent market forecasters often promote that although there was a reasonably high during the three recession years 2000–2002,

December 2021 significancemagazine.com 23


17409713, 2021, 6, Downloaded from https://rss.onlinelibrary.wiley.com/doi/10.1111/1740-9713.01588 by Eth Zürich Eth-Bibliothek, Wiley Online Library on [23/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Finance David H. Bailey recently retired from the Lawrence Berkeley National
Laboratory, where he was a senior scientist. He is a research associate at the
University of California, Davis, and has published over 300 research articles
in the fields of high-performance scientific computing, computational
mathematics and financial mathematics.

Evaluating investments
Investments are typically evaluated by the Sharpe ratio, a metric millions, or even billions of variations of a given strategy and
of the performance of an asset relative to its volatility select only the “optimal” variation, it follows that it is very easy to
(“riskiness”).7 It is calculated by dividing the expected excess find impressive-looking strategy variations that are nothing more
returns relative to a risk-free asset, like a US Treasury bond, by the than false positives.
standard deviation of the returns. To make Sharpe ratios The present authors combined the ideas behind the probabilistic
comparable across investments with different sampling frequency, Sharpe ratio and the false strategy theorem to derive a formula for
the ratio is often “annualised”, by multiplying it by the square root deflating the Sharpe ratio.10 The deflated Sharpe ratio is the
of the number of observations in a year. However, annualised probability that an observed Sharpe ratio was drawn from a
Sharpe ratios should not be thought of as t-values for testing the distribution with positive mean, after controlling for sample
significance of the sample mean, since they do not take into length, skewness, kurtosis, and the number of strategy variations
account the number of observations. To correct for this problem, explored. Let us suppose that a researcher is constructing a
the present authors proposed the probabilistic Sharpe ratio,8 financial model or strategy based on the daily closing values of the
which allows one to test the significance of the Sharpe ratio under FTSE 100 index. An observed annualised Sharpe ratio of 1, where
general conditions of stationarity and ergodicity. the backtest length is 10 years of daily returns drawn, may appear
Another useful tool is the false strategy theorem.9 An investment to be strong evidence of a true discovery. However, if the
analyst may carry out a large number of simulation trials on researcher conducted three or more independent trials, our
historical data, and report only the model, fund or strategy with the confidence that the finding is statistically significant is below the
maximum Sharpe ratio. But the distribution of the maximum standard 95% cutoff. Figure 3 shows the deflated Sharpe ratios for
Sharpe ratio is clearly not the same as the distribution of a Sharpe strategies with observed annualised Sharpe ratios of 0.5, 1, and
ratio randomly chosen among the trials. Instead, the expected value 1.5, as a function of the number of trials. In practice, investment
of the maximum Sharpe ratio is greater than the expected value of strategies’ returns often exhibit positive autocorrelation, negative
the Sharpe ratio from a random trial. In particular, given an skewness, and fat tails, which further depress the deflated Sharpe
investment strategy with expected Sharpe ratio zero and non-zero ratio. The implication is that, in most cases, as few as three
variance, the expected value of the maximum Sharpe ratio steadily independent trials suffice to produce an investment strategy that
increases, up from zero, as a function of the number of trials. One is likely false.
can thus deduce an expected maximum Sharpe ratio, namely the
hurdle or threshold that the reported Sharpe ratio must exceed
before it can be considered a significant finding. This result is
known as the false strategy theorem: given a sample of estimated
performance statistics {Sk}, k = 1, …, K, each independently
following a zero-mean, unit-variance Gaussian distribution, we have
1 1
E[maxk{Sk}] ≈ (1 – γ)Z–1 1 – + γZ–1 1 –
K Ke
where E[·] denotes expected value, Z–1[·] denotes the inverse of the
standard Gaussian cumulative distribution function, e is Euler’s
number (2.718281828…, the base of natural logarithms), and γ is
the Euler–Mascheroni constant (approx. 0.5772156649…).
In practical terms, the false strategy theorem tells us that the
optimal outcome of an unknown number of historical market data
simulations is right-unbounded. In other words, with enough
trials, there is no Sharpe ratio threshold sufficiently large to reject
the hypothesis that a strategy is false. The rule of thumb of halving
the backtest’s Sharpe ratio, popular among many investment
Figure 3: Deflated Sharpe ratios as a function of the number of trials, based
professionals, has no scientific basis. Again, given the ease with on backtests of 10 years of independent and identically distributed normal
which one can use a computer to explore literally thousands, daily returns.

yet they underestimated the index level by first half of 2009. In other words, as Kaissar technical analysis, a relatively unsophisticated
10.6% for the initial recovery year, 2003. A lamented, “the forecasts were least useful form of historical data analysis.3 Expanding on
similar phenomenon was seen in 2008, when when they mattered most”. an earlier study, we analysed forecasts based
strategists overestimated the S&P 500’s year- In 2018, the present authors published, with on two key factors: the time-frame of the
end level by a whopping 64.3%, but then a colleague, an in-depth analysis of 68 market forecast and the importance and specificity
underestimated the index by 10.9% for the forecasters, including many who employ of the forecast. Our study found that the

24 SIGNIFICANCE December 2021


17409713, 2021, 6, Downloaded from https://rss.onlinelibrary.wiley.com/doi/10.1111/1740-9713.01588 by Eth Zürich Eth-Bibliothek, Wiley Online Library on [23/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Marcos López de Prado is professor of practice at
Cornell University’s School of Engineering, in New York,
NY, USA, and global head of quantitative research and
development at the Abu Dhabi Investment Authority,
the sovereign wealth fund of Abu Dhabi, UAE.

average accuracy score of these forecasts was


48%, not substantively different than chance. Glossary
Although a handful of forecasters did well, ■ Anomaly indicator. A signal in financial market data that may indicate a notable
there was no evidence of overall forecasting change in direction or an investment opportunity.
skill in the set studied. ■ Backtest overfitting. The use of historical market data to develop an investment
In another recent study of the statistical model, fund or strategy, where too many variations are tried relative to the amount of
reliability of anomaly indicators in finance data available.
(signals in financial market data that may ■ Exchange-traded fund (ETF). A mutual fund whose shares may be freely traded during
indicate an investment opportunity), the the trading day like shares of an individual stock or bond.
authors soberly concluded that they were ■ Index. A set of stocks or bonds, together with corresponding weights, typically defined
not able to statistically replicate most of the in an objective way by some fixed definition or governing committee; for example, the
findings that had been reported in a large S&P 500 (US stocks) and the FTSE 100 (European stocks).
set of published papers in the academic ■ Index fund or passive index fund. A mutual fund tied to a defined index, where no
finance field.4 Of the 452 anomaly indicators attempt is made to manage the fund except to follow the index as closely as possible.
studied, 65% did not even clear the single test ■ In-sample data. Historical data used as input to the design of a model, fund or strategy.
threshold of t = 1.96 or greater, when correctly ■ Mutual fund. An investment fund, typically consisting of a certain set of stocks or bonds
analysed. With a more stringent criterion that selected according to some strategy, index or risk level; may be “active” or “passive”.
partially compensates for multiple testing, ■ Out-of-sample data. New input data used to test a model, strategy or fund.
namely t = 2.78 at the 5% significance level, ■ Selection bias under multiple testing. Statistical bias that occurs when a researcher
the failure rate increases to 82%. conducts multiple tests or analyses, but only reports the test with the best outcome.
Why the poor performance in these studies? ■ Technical analysis. A relatively unsophisticated form of historical market data analysis, often
In some cases, the discovered phenomenon involving charts and graphs, that typically ignores statistical problems such as overfitting.
may fade away following its publication,
as a result of intense competition among
investors.5 However, a more likely explanation found a statistical pattern, they can easily References
is that it was a false discovery to begin with: build a theoretical explanation around it to 1. Brightman, C., Li, F. and Liu, X. (2015) Chasing
a result of widespread selection bias under rationalise what in reality is nothing more performance with ETFs. Fundamentals, Research
multiple testing. than data snooping. Affiliates. bit.ly/3DCD3Ku
Some in the finance field have questioned 2. Lopez de Prado, M. and Lewis, M. (2019) Detection of
A pervasive problem the existence of a replication crisis. They false investment strategies using unsupervised learning
Backtest overfitting can be thought of as have argued that concerns with backtest methods. Quantitative Finance, 19(9), 1555–1565.
the financial field’s variation of p-hacking, overfitting are overblown, or that certain 3. Bailey, D. H., Borwein, J. M. and Lopez de Prado, M.
namely the deplorable practice, conscious investment styles (e.g., “factor investing”) are (2018) Evaluation and ranking of market forecasters.
or not, of publishing results of a study not as susceptible as others to overfitting. We Journal of Investment Management, 16(2), 47–64.
based on a subset of the actual data or trials do not agree. Rather, the preponderance of 4. Hou, K., Xue, C. and Zhang, L. (2020) Replicating
performed, in order to exhibit some desired poor out-of-sample performance points to a anomalies. Review of Financial Studies, 33, 2019–2133.
level of statistical significance.6 In order to pervasive problem in the field. As Campbell 5. McLean, R. D. and Pontiff, J. (2015) Does academic
control for this effect in, say, the biomedical R. Harvey, past president of the American research destroy return predictability? Journal of
field, leading journals and regulatory bodies Finance Association, lamented in his 2017 Finance, 71(1), 5–32.
increasingly require researchers to report presidential address, “our standard testing 6. Harvey, C. R. (2017) Presidential address: The
the results from all trial data, so that the methods are often ill equipped to answer the scientific outlook in financial economics. Duke I&E
likelihood of false positives can be discounted questions that we pose”.6 Research Paper No. 2017-05.
from the reported results. However, many in One of the most pressing of questions is, 7. Sharpe, W. F. (1994) The Sharpe ratio. Journal of
the field of finance do not realise that the very “Which investment strategy will deliver a Portfolio Management, 21, 49–58.
act of performing a computer search for an market-beating return?” That is a hard question 8. Bailey, D. H. and Lopez de Prado, M. (2012) The Sharpe
“optimal” design almost certainly renders the to answer. What we can say is that trawling ratio efficient frontier. Journal of Risk, 15(2), 3–44.
results statistically overfitted. Furthermore, again and again through historical market data 9. Bailey, D. H., Borwein, J. M., Lopez de Prado, M. and
textbooks tend to ignore or downplay the in a bid to identify an “optimal” approach will Zhu, Q. J. (2014) Pseudo-mathematics and financial
challenges posed by multiple testing, and often lead to a dead end. Without keeping an charlatanism: The effects of backtest overfitting on
most academic finance journals fail to eye on the problem of backtest overfitting in out-of-sample performance. Notices of the American
require authors to declare the full extent finance, investors can end up chasing statistical Mathematical Society, May, 458–471.
of computer trials involved in a discovery, illusions in their pursuit of profit. 10. Bailey, D. H. and Lopez de Prado, M. (2014) The
even though the authors may well have deflated Sharpe ratio: Correcting for selection bias,
performed an extensive computer search for Disclosure statement backtest overfitting and non-normality. Journal of
optimal parameters. After a researcher has The authors declare no conflicts of interest. Portfolio Management, 40(5), 94–107.

December 2021 significancemagazine.com 25

You might also like