You are on page 1of 34

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341106078

Quantitative Data Analysis in Finance Forecasting Daily Volatilities of Global


Stock Indexes

Research · April 2020


DOI: 10.13140/RG.2.2.31535.07840

CITATIONS READS

0 520

3 authors, including:

Sidharth Sabharwal Aarsh Sachdeva


Princeton University Princeton University
1 PUBLICATION   0 CITATIONS    2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

KNN Macroeconomic Equity Style Rotation Model View project

All content following this page was uploaded by Aarsh Sachdeva on 02 May 2020.

The user has requested enhancement of the downloaded file.


PRINCETON UNIVERSITY

Quantitative Data Analysis in Finance

Forecasting Daily Volatilities of Global


Stock Indexes
April 30th, 2020

Sabharwal, Sidharth
Sachdeva, Aarsh
Sanchez-Escobar, Nicolas
Contents
1 INTRODUCTION 2

2 DATA 2
2.1 Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 ARCH/GARCH Methods . . . . . . . . . . . . . . . . . . . . . . . . 4

3 MODEL BUILDING 4
3.1 Penalized Regression Methods . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Ensemble-Tree Methods . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 ARCH/GARCH Methods . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 MODEL EVALUATION 7
4.1 Mean/Median Squared Error . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Gaussianity Assumption . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Coverage Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 Feature Importances . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 PREDICTIONS 20
5.1 Out of Sample Testing (Feb. 1 to Apr. 30) . . . . . . . . . . . . . . . 20
5.2 VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Feature Importances . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 True Out of Sample (May 1 to May 22) . . . . . . . . . . . . . . . . . 28

6 FURTHER IMPROVEMENTS 29

7 CONCLUSION 29

8 APPENDIX 31
8.1 List of Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.2 List of VIX Equivalents . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1
1 INTRODUCTION
The goal of this study is to use machine learning and statistical modeling meth-
ods to forecast daily volatilities of major global stock indexes after the COVID-19
outbreak. We also provide risk measures that can help portfolio managers make
investment decisions during this turbulent period. Given that the world is facing
major structural change with global spillovers and unpredictable economic and so-
cial consequences, we aim to provide new insights into how to face the high degree
of uncertainty caused by the COVID-19 crisis.
Using three penalized regression methods, two ensemble-tree methods, and two au-
toregressive methods, we fit models using a high-dimensional data set of realized
variances and economic variables from January 1st, 2000 to January 31st, 2020. We
analyze the predictive performance of each model over five different horizons ranging
from one day forward to one month forward. For each horizon we examine perfor-
mance over two regimes, high volatility and low volatility, in addition to overall
performance. We use mean and median squared errors and coverage probabilities to
evaluate our models.
We then provide daily, weekly, and monthly realized variance predictions for the
period of February 1st, 2020 to April 30th, 2020 and investigate how they compare
to true realized variances. We also provide value-at-risk estimates and evaluate them
using techniques laid out by Lopez (1998) at the Federal Reserve Bank of New York
[1]. Finally, we present true out of sample predictions for the period of May 1st to
May 22nd.

2 DATA
2.1 Machine Learning Methods
The initial data set consists of daily open prices, close prices, and realized variances of
31 global stock indexes from January 1st, 2000 to January 31st, 2020. We construct
a data frame of realized variances to use as covariates for our machine learning
models, including also rolling sums of one week and one month to reflect medium
and longer term historical variance. Although high variance tends to be correlated
with negative returns, we choose not to include daily returns or prices because this
information exists within the variance data itself.
In addition, the following covariates are included in the data set:
• 10-Year/2-Year Treasury Spread: The difference in yields of a 10-year
US treasury bond and a 2-year US treasury bond. This spread reflects eco-

2
nomic uncertainty in the US and globally. When markets are more uncertain,
investors dive into safe haven assets, and longer-dated bonds become more
attractive. This causes the yield on these bonds to shrink and the spread
to narrow. Because this spread is generally negatively correlated with equity
returns and is also an indicator of future market fear, its inclusion partially
addresses the market’s asymmetric response to positive and negative news.
For this reason, the spread is informative to our data set.
• Brent Crude Oil Prices: Daily price of a barrel of North Sea Brent crude
oil. Brent crude prices are correlated with global political and economic uncer-
tainty and also serve as a proxy for global demand and supply relationships.
Including a measure of these relationships is beneficial with respect to fore-
casting volatility.
• Economic Policy Uncertainty Index for United States: An index re-
flecting economic policy uncertainty in the US, calculated based on news cov-
erage, tax code data, and economic forecaster disagreement. Because high
variance tends to correlate with high levels of uncertainty, this index is helpful
for forecasting volatility.
The above three data sets are downloaded from the St. Louis Federal Reserve FRED
website.
In addition, we include daily close data of the respective short-term VIX equivalent
when predicting variances for each index. Because the VIX and its equivalents are
measures of implied forward volatility rather than backward-looking realized vari-
ance, we believe each index’s VIX equivalent provides valuable insight into investors’
expectations of the current regime and thus is informative to our covariate set. VIX
equivalent data is downloaded using a Bloomberg Terminal.
To avoid overfitting our models, we do not include all of the VIX equivalents in
the covariate data set–we include only the one that applies to the specific index we
are forecasting for. For some indexes, their VIX equivalent had been discontinued
before April 30th, 2020 and so the data was not included.
Most indexes have relatively little amounts of missing data. When data is missing we
forward fill from the previous day, reflecting both the clustering effect of volatility
and the persistence in volatility dynamics, as well as the fact that previous day
volatility is generally a good estimate of next day volatility. We do not back fill any
missing data to avoid look-ahead bias.
However, STI specifically is missing data from January 9th, 2008 to September 18th,
2015. This is clearly too long of a period to forward fill data, and so we do not include
any data for STI between these dates. We only include STI in the covariate data

3
set when we have data for the past 1000 trading days (the full size of our training
window) to fit the models.

2.2 ARCH/GARCH Methods


For the autoregressive models, we construct a data frame of daily returns from
our original data set. We consider open-to-close returns rather than close-to-close
returns because the realized variance data is also intraday (i.e., is calculated only
from open to close). We do not include any additional data because the models
themselves only take past returns as an input.

3 MODEL BUILDING
We fit each model on a rolling window of size 1000, including approximately four
years of data. 1000 days generally captures a breadth of market environments, in-
cluding both high volatility and low volatility periods without including too much
data to not be indicative of the current regime, addressing potential structural breaks
in the sample. We only include indexes that have a full trailing data set, so most pre-
dictions start on November 25, 2003, 1000 days after the data set begins. However,
some indexes do not have a full window until much later. For example, BVLG has
no data before October 15th, 2012. For this reason, we do not include BVLG in our
covariate data set or provide predictions for BVLG until we have 1000 trailing data
points. We choose to forecast log(variance) and exponentiate our predictions rather
than forecasting variance directly due to the nonnegativity constraint of variance.
Due to the size of the data set and computing constraints, it is not feasible to fit the
models on the full data set. To manage this, we randomly sample 20% of the data to
include. We choose to sample randomly as to not introduce any bias. For example,
sampling every 5th point would mean that most data points are from the same day
of the week. On each day, the lookback window comprises of all the sampled data
points that fall within the past 1000 days. We then randomly sample 50% of the
lookback window with which to fit our models. We choose to set a random seed of
580 so that each model provides predictions for the same dates.
We analyze five horizons of forward variance predictions: one day, one week, two
weeks, three weeks, and one month. When computing out of sample forecasts for
30 days, we will use the model that performs the best on the horizon closest to
the number of days we are forecasting, for each forecast. For example, we will use
the model that was able to predict one day forward variance the best to predict
variance for one and two days out, and the model that performed best on a horizon
of five trading days for our predictions three to seven days out. We will also analyze

4
performances separately during large drawdown periods to further address potential
nonlinearities caused by asymmetric responses to good and bad news.
The details of each model are outlined below.

3.1 Penalized Regression Methods


We consider Adaptive LASSO, Elastic Net, and SCAD. For these three models, we
select the hyperparameter lambda by Bayesian Information Criteria because BIC
tends to penalize feature selection more heavily. The data set we are working with
is highly correlated, and selecting by BIC helps avoid overfitting. We choose a
mixing parameter of 0.5 for Elastic Net (reflected as 1/3 in the code due to a detail
in Python’s Sci-Kit Learn) to reflect even mixing between L1 and L2 regularization.
We choose not to fit an intercept because the data is standardized before fitting
penalized regression models, and should theoretically pass through the origin for
any regression type model:
Y = βX + α
E[α] = E[Y ] − E[βX] = 0 − 0
α=0

Our choices of lambda ranges for Elastic Net and LASSO are appropriate as the
models always selects values in the middle of the range, as evidenced below in
Figures 1 and 2. Note that these are examples from one horizon but the same is
true for every horizon.

Figure 1: Elastic Net Lambda Choice vs. Time

5
Figure 2: Adaptive LASSO Lambda Choice vs. Time

3.2 Ensemble-Tree Methods


We also include a Gradient-Boosted Tree and Random Forest due to their higher
degree of robustness to outliers. We use the Huber loss function rather than the
least-squares loss to increase the robustness of the Gradient-Boosted Tree, leaving
the alpha-quantile as 0.9 as is standard in Sci-Kit Learn.

3.3 ARCH/GARCH Methods


We also consider a GARCH(1,1) model and a FIGARCH model. GARCH(1,1) is
fairly standard across the industry, and we include FIGARCH to allow for time
varying persistence in the volatility dynamics.

3.4 Benchmarks
We use two benchmarks to compare our predictions to:
ˆ t+1:t+h|t = RVt
• Random Walk: RV
• Heterogeneous Autoregressive Model (HAR): RV ˆ t+1:t+h|t = β0 + β1 RVt +
β2 RVw,t + β3 RVm,t , where RVw,t and RVm,t represent the average realized vari-
ance over the past week and month.

6
4 MODEL EVALUATION
4.1 Mean/Median Squared Error
We first evaluate our models based on mean squared errors, calculating MSE sepa-
rately for each index and then averaging those values across indexes for each model
at each horizon (measured in trading days). Because we are investigating daily re-
alized variances and the values are quite small, we consider the mean squared error
of each model relative to that of the random walk model over the same horizon:

Figure 3: Relative Mean Squared Error (Overall)

While our two ensemble tree methods perform quite well, we are extremely surprised
at the magnitude of the MSE for each of the penalized regression models relative
to the random walk. That is, the random walk model performs on average 6.75E8
times more accurately than the Adaptive LASSO in terms of squared loss. Taking
a closer look, we realize that while the mean squared error for the Adaptive LASSO
was on the order of 10E-6 for most indexes, the mean squared error for SMSI was
in the range of 10E2.
We plot the Adaptive LASSO one day forward variance predictions and log variance
predictions for SMSI below, keeping in mind that we forecast log variance rather
than variance directly due to the nonnegativity constraint:

7
Figure 4: AdaLASSO 1 Day Forward Variance Predictons (SMSI)

Figure 5: AdaLASSO 1 Day Forward Log Variance Predictons (SMSI)

We realize that our penalized linear regression models are being thrown off by outliers
occuring during the flash crash in August of 2016 (for several indexes, not just for
SMSI) to the point that they are sometimes predicting highly positive values for log
variance. This effect is amplified as we exponentiate our predictions. While only
one example is illustrated here, the same is true for at least one index for all three
penalized regression models across every horizon.
The two autoregressive models and the HAR model also severely underperform the
ensemble tree methods. This may be due to the fact that autoregressive methods
take only daily open-to-close returns as inputs and HAR takes only three data points.
This has two implications. Firstly, the machine learning models may perform better

8
because they have access to a full set of covariates including all of the realized
variances as well as implied forward volatility and macroeconomic variables. Second,
there may be some difference arising from the fact that the realized variances that
we are forecasting are calculated intraday at 5 minute increments and are different
from the open-to-close volatility that autoregressive models predict.
Keeping all of the above in mind, we now investigate median squared error as well.
Median squared error is more robust than mean squared error, and so the few outliers
should have a less outsized effect. Again, we take the median for each index and
average across indexes for each model and horizon, scaling by the median squared
error of the random walk over the same horizon.

Figure 6: Relative Median Squared Error (Overall)

These figures seem much more reasonable. We notice that in both cases, either the
Boosted Tree or the Random Forest performs the most accurately dependent on the
horizon. The first case is unsurprising as we specifically chose those two methods
for their robustness to outliers, and further selected a loss function for the Boosted
Tree that does not weigh outliers as heavily as the least squares loss. The fact that
they also outperform in terms of median squared error reflects that the ensemble
methods are able to forecast variance extremely well.
In fact, only the two ensemble methods outperformed our benchmarks in terms of
mean squared error, while every model outperformed HAR and the five machine
learning models outperformed the random walk in terms of median squared error.
We also notice that our machine learning models all become more accurate relative
to the random walk over longer horizons, while the same is not true for autoregressive
models or HAR.
Now we move to investigate how each model performed under two different regimes:
high volatility and low volatility. Because we want to evaluate our models over
continuous stretches of time, we define a high volatility period as being during
a drawdown of 10+% in the S&P500 as well as the month immediately following.
Within our training sample, we identify four such periods: September 2008 to March

9
2009, August 2011 to November 2011, August 2015 to March 2016, and December
2018 to January 2018 (see code for methodology).

Figure 7: Relative Mean Squared Error (High Volatility Periods)

Figure 8: Relative Median Squared Error (High Volatility Periods)

We notice the same trends in high volatility periods that we observed while evalu-
ating our models overall. Note that here we are only interested in relative accuracy
between models–the absolute mean and median squared errors are higher during
high volatility periods, but the two ensemble tree methods still perform the best,
depending on the horizon we are forecasting for. For completeness, we also examine
how each model performed in low volatility periods:

Figure 9: Relative Mean Squared Error (Low Volatility Periods)

10
Figure 10: Relative Median Squared Error (Low Volatility Periods)

We see the same trends in low volatility environments as well.

4.2 Gaussianity Assumption


We make the assumption that returns are of the following form:
(
rt|t−1 = ht t
(1)
t ∼ N (0, 1)

Therefore, we calculate the Value-at-Risk measure as follows:


q
V aRα,t+1 = cα RVˆt+1 (2)

Because we are studying indexes rather than individual stocks, we expect returns
to be somewhat Gaussian due to Aggregational Gaussianity and the central limit
theorem. However, it is important to understand whether assumption (1) is reflective
of reality and to what extent it is a restrictive assumption with direct consequences
in the VaR estimate as shown in (2). To do so, we study the standardized returns (t )
and compare them to a standard normal distribution. Assuming (1), we divide the
open-to-close returns by the respective realized volatility. We have chosen open-to-
close returns because our realized variance data does not account for overnight price
fluctuations. We perform a Jarque-Bera test to test the distribution of standardized
returns for normality, with the results outlined in the table below.

11
Jarque-Bera Test P-Values for Standardized Returns (%)
AEX 0.27 KS11 0.06
AORD 0 KSE 0
BFX 10.07 MXX 31.77
BSESN 0 N225 0
BVLG 17.34* NSEI 0
BVSP 0 OMXC20 0.56
DJI 0 OMXHPI 0.17
FCHI 0.97 OMXSPI 0.01
FTMIB 3.68* OSEAX 0
FTSE 0 RUT 0
GSPTSE 0 SPX 0
HSI 0 SSEC 0.16
IBEX 0.21 SSMI 5.13
IXIC 0 STI 31.96
AEX 0.27 KS11 0.06
STOXX50E 0.01
*Fewer than 2000 sample points. We recall that Jarque-Bera test needs a larger sample
to work properly.
At a 5% threshold, we reject the Gaussian assumption for most of the indexes. Our
standardized returns may either be skewed or have excess kurtosis, or both. For
our purposes we care exclusively about the left tail of the distribution, to inform
us whether applying the Gaussian quantile (cα in (2)) is justified in calculating
Value-at-Risk measures.
We look to a more visual representation where we can easily observe the behavior of
the left tail compared to the normal left-tail. Below are the QQ-Plots comparing the
empirical quantiles of standardized returns of each index to the Gaussian quantiles.
Please note that the QQ plots were produced using R, not Python.

12
Figure 11: Normal QQ-Plots of Standardized Returns
13
The standardized returns generally appear to have lighter tails than the Gaussian
distribution, the right tail especially so. This empirical fact is surprising at first,
since financial returns are usually quite heavy tailed. We can thus assume that
the intraday volatility explains almost perfectly the magnitude of the deviations.
The non-normality results of the Jarque-Bera test come from the left-skewed shape
of the distribution. Since the normal distribution is symmetric, if the left tail is
well calibrated to the distribution, the right tail should be lighter than the one
represented by the Gaussian distribution.
We conclude that the assumption of Gaussianity is fair for the left tail of the distri-
bution, and therefore our calculation for Value-At-Risk estimates is appropriate.

4.3 Coverage Probability


We have analyzed the quality of the volatility forecasts of different models and
verified that the mathematical framework is adapted by studying the standardized
returns. We observed that the ML methods outperformed the Random Walk and
HAR (our benchmarks). However, choosing between the models at each horizon
is not straightforward, specifically between the Boosted Tree and Random Forest
whose accuracies are similar.

Lopez (1998) presents different methods to differentiate VaR methods [1]. He men-
tions two ways: hypothesis testing and minimizing a loss function. The former
allows for verifying characteristics that a VaR measure should have, including inde-
pendence and being violated indeed with only a probability α%. The latter defines
several loss functions that a good VaR estimate should minimize. Lopez (1998)
shows that the statistical tests have unsatisfying power, being unable to clearly re-
ject bad estimates. However, the loss function minimization method discriminates
better in almost every case [1].

In the paper, he asserts that the Magnitude Loss Function performs best overall,
allowing for evaluation also with respect to the magnitude of the return that goes
beyond the limit proposed by the Value-At-Risk. It is similar to the notion of
Expected Shortfall and we are thus not surprised that this loss function performs
best. The Magnitude Loss Function is defined as follows:
(
1 + (t+1 − V aRmt )2 if t+1 < V aRmt .
Cmt+1 = (3)
0 otherwise.

Following the same modus operandi we did for squared errors, we calculate the value

14
of the loss function for every method and every horizon relative to the performance
of the random walk model. We realize that if the volatility forecast is extremely
inaccurate, our VaR measure may be extremely conservative (for instance, when
the predicted variance is much higher than the real value) and the VaR would
not be violated. This would minimize the loss function even though the model was
inaccurate. For this reason, we choose to focus here only on the ML methods, having
seen that time series models like HAR, GARCH(1,1) and FIGARCH underperformed
the random walk. The results are presented below for VaR at 1% and 5%:

Figure 12: Relative Magnitude Loss Function values for V aR5%

Figure 13: Relative Magnitude Loss Function values for V aR1%

These results are less conclusive than those of the squared errors. We recall that
previously the ensemble methods outperformed the linear models. Therefore, even
if penalized linear models seem to provide a better Value-at-Risk measure for short
periods (1-day), we will consider the ensemble methods because the difference is not
significant. For longer horizons, however, we find again that the ensemble methods
perform best.
There are certain differences between the Boosted Tree and Random Forest depen-
dent on horizon, so it may be advantageous to combine them into an optimized
estimator. We will explore this option in the following sections.

15
4.4 Model Specification
Combining our model performances with respect to MSE, median squared error,
and coverage probability, we see that an optimized strategy combines the Random
Forest and Boosted Tree methods dependent on horizon. Noticing that the Boosted
Tree tends to outperform on longer horizons, we cautiously choose the Random
Forest for predictions up to 10 days out, and the Boosted Tree for predictions 11
to 22 days out. To reassure ourselves that this strategy is sensible and that we are
not making this decision based on insignificant differences in performance, we will
explore whether the predictive values from the Random Forest and Boosted Tree
are statistically different. To do so, we perform a Diebold-Mariano Test. We study
the significance of differences in predictive values not only overall, but also in high
volatility and low volatility periods, again defining high volatility periods as those
during a 10+% drawdown in the S&P 500 and the month immediately following.
The results are shown in the following table:

Diebold-Mariano Test P-Values : Random Forest and Boosting Tree (%)


INDEX 1 day 5 days 10 days 16 days 22 days
AEX 57.4 49.32 45.98 2.73 4.29
AORD 33.04 82.84 82.22 57.49 80.91
BFX 30.76 46.67 3.45 2.34 28.15
BSESN 27.46 32.61 75.69 91.61 88.02
BVLG 38.67 57.4 14.13 29.6 7.13
BVSP 37.07 17.04 4.21 82.93 15.59
DJI 29.17 44.48 50.49 28.41 39.35
FCHI 43.71 31.11 7.39 26.79 25.98
FTMIB 8.28 39.39 97.49 14.76 13.43
FTSE 21.58 58.48 14.31 21.63 1.89
GDAXI 3.96 2.94 44.54 3.71 28.89
GSPTSE 16.79 68.24 95.61 6.06 67.82
HSI 13.49 63.27 69.26 21.96 54.63
IBEX 5.36 29.48 1.75 1.21 2.69
IXIC 24.27 43.72 41.5 35.16 13.1
KS11 31.7 57.86 59.8 24.1 79.11
KSE 21.41 20.93 2.56 2.81 0.27
MXX 30.29 87.46 29.28 93.26 5.65
N225 52.68 38.04 39.06 77.95 14.1
NSEI 22.87 41.67 12.77 13.2 30.96
OMXC20 80.61 65.82 80.62 55.9 19.63
OMXHPI 31.58 24.71 22.05 11.53 28.84
OMXSPI 34.59 13.95 13.17 49.78 31.38
OSEAX 17.8 33.47 8.78 32.15 85.88
RUT 95.77 54.6 76.89 45.2 44.69

16
SMSI 14.8 86.27 16.24 1.85 12.33
SPX 87.62 79.05 73.46 23.13 49.21
SSEC 1.73 47.73 57.97 63.36 73.29
SSMI 62.18 84.22 3.07 4.44 2.39
STI 20.98 93.15 78.31 51.13 23.85
STOXX50E 9.38 91.82 35.57 26.06 7.03

*Dark Green: 5% Confidence, Green: 10% Confidence, Light Green: 15% Confidence.
Certainly, the results are not always significantly statistically different from each
other. However, we can see clearly that for horizons bigger than 16 days, the pre-
dictions of the Random Forest and Boosted Tree differ substantially. At a horizon
of 1 day, we still do have statistical differences for some indexes at different confi-
dence levels. Therefore, selectively switching from Random Forest to Boosted Tree
may increase the out-of-sample accuracy compared to a full Random Forest or a full
Boosted Tree strategy.
Our goal was to further separate the model into one that performed well during
drawdowns (high volatility) and one that performed well when markets were calm
in order to further address the asymmetric market response to positive and negative
events. However, given the MSE results, we do not see any improvement from
changing the algorithm according to the macroeconomic environment. We suggest
another method to account for this difference in the Further Improvements section.
Finally, for the out of sample predictions during the COVID-19 pandemic, we will
also include in our covariates daily new diagnosed COVID cases. We believe that
in this uncertain period, markets partially react to expect As we did with the VIX
equivalent indexes, we will only add data for the respective country whose index we
are forecasting. We believe that in this uncertain period, markets partially react to
changes in future expectations of COVID cases, and so we use the first and second
differences in infections rather than the infection numbers themselves.

4.5 Feature Importances


Now, we investigate the features that the ensemble-tree models deemed most im-
portant. While this is not a factor in our model selection, it is useful for our under-
standing of which covariates may have the most predictive power. For our purposes,
we will investigate the predictive feature importances of the S&P 500 and the Nikkei
225, considering the three features of highest importance at all lags. Note that we
included the VIX as a covariate for the S&P 500 but did not include an equivalent
for Nikkei because one was not available.

17
We outline feature importances for S&P 500 variance below. BT1, BT2, and BT3
refer to the first, second, and third most important features for the Boosted Tree,
and RF1, RF2, and RF2 refer to the same for the Random Forest.

Horizon (Days) BT1 BT2 BT3


1 VIX IXIC 1 day SPX 5 day
0.37 0.08 0.07
5 VIX IXIC 5 day IXIC 1 day
0.30 0.04 0.03
10 VIX OMXHPI 5 day IBEX 22 day
0.32 0.04 0.04
16 VIX IBEX 22 day OMXHPI 22 day
0.21 0.05 0.04
22 VIX IBEX 22 day 10Y/2Y spread
0.13 0.07 0.06
Horizon (Days) RF1 RF2 RF3
1 VIX SPX 1 day SPX 5 day
0.30 0.08 0.08
5 VIX SPX 5 day IXIC 1 day
0.35 0.05 0.04
10 VIX SPX 5 day OMXHPI 5 day
0.26 0.04 0.03
16 VIX GSPTSE 22 day IBEX 22 day
0.18 0.06 0.05
22 VIX GSPTSE 22 day IBEX 22 day
0.10 0.07 0.07

S&P 500 Feature Importances for Boosted Tree (BT) and Random Forest (RF)
We notice a few important trends here. First, we see that the VIX is always the
most important predictive feature. This makes intuitive sense to us, as the VIX
is the only forward-looking implied covariate included in our data set. The VIX
has outsized importance on shorter term horizons (approximately 4-7x as much as
the next most important covariate), but less importance on longer time horizons
compared to other covariates. Although not illustrated here, most covariates other
than the VIX all have relatively similar feature importances.
Another interesting point to note is that longer term historical variances become
more important for forecasting longer term horizons, though this effect is relatively
small. The Random Forest tends to choose historical S&P 500 variances as an

18
important feature more often than the Boosted Tree. This is likely simply because
the S&P 500 is heavily correlated with most global stock indexes and may also result
from differences in the boosting and bagging algorithms, given that Boosted Trees
are more prone to overfitting.
Below, we present the same table for the Nikkei 225.
Horizon (Days) BT1 BT2 BT3
1 N225 1 day N225 5 day N225 22 day
0.33 0.22 0.06
5 N225 5 day N225 22 day N225 1 day
0.22 0.13 0.13
10 N225 5 day N225 22 day N225 1 day
0.17 0.13 0.06
16 N225 22 day N225 5 day GSPTSE 22 day
0.13 0.11 0.05
22 N225 22 day GSPTSE 22 day N225 5 day
0.13 0.07 0.05
Horizon (Days) RF1 RF2 RF3
1 N225 1 day N225 5 day N225 22 day
0.28 0.23 0.03
5 N225 5 day N225 1 day N225 22 day
0.25 0.10 0.09
10 N225 5 day N225 22 day GSPTSE 22 day
0.18 0.10 0.06
16 N225 22 day GSPTSE 22 day N225 5 day
0.12 0.10 0.10
22 N225 22 day GSPTSE 22 day Brent
0.14 0.12 0.06
Nikkei 225 Feature Importances for Boosted Tree (BT) and Random Forest (RF)
The ensemble models tend to find that the past variances of the Nikkei at every
lag are most important for predicting future variance. This difference stems from
the fact that the Nikkei is less correlated with other global stock indexes than the
S&P. In addition, we again notice that longer term historical variances are more
important for predicting longer horizons.
The biggest difference between the Nikkei and S&P feature importances is that the
Nikkei does not have one feature that dominates the rest like the VIX for the S&P.
Even though the ensemble models tend to pick the same covariates for each horizon,
the magnitudes of their importance are not significantly larger than other covariates.

19
In fact, this trend was seen across all of the stock indexes–when a VIX equivalent
was available as a covariate, the ensemble models generally found that it was by far
the most important feature.
From these two tables, we can see clearly the effects of volatility spillover. Volatility
spillover is essentially cross-correlation and leading/lagging effects among different
markets. Essentially, realized volatility in one market will carry over into volatility
for another market (for example, if S&P 500 variance spikes one day, then Nikkei
225 variance will spike during its market hours that day or the next day). We can
see that this effect is much larger for the S&P than for the Nikkei, as mentioned
above. We accounted for this effect by including the short, medium, and long term
historical volatilities of all the other indexes in the feature set for predicting each
index. Any significant coefficient attached to a different index’s historical variance
indicates some level of cross-correlation/spillover. It can be interpreted as a lead-lag
effect because one variable’s historical value (lag) is correlated with the dependent
variable’s present value (lead).

5 PREDICTIONS
5.1 Out of Sample Testing (Feb. 1 to Apr. 30)
Now that we have defined our model, we forecast values during our testing sample
of February 1st, 2020, to April 30th, 2020. For illustrative purposes, the one day
forward, one week forward, and one month forward forecasts for the S&P 500 are
plotted below against the actual realized variance values.

20
Figure 14: Predicted and Actual One Day Forward S&P 500 Variance

Figure 15: Predicted and Actual One Week Forward S&P 500 Variance

21
Figure 16: Predicted and Actual One Month Forward S&P 500 Variance

The model seems to capture variance moves fairly well, though the pattern seems
clearest on a horizon of one week. However, we do notice that we tend to underes-
timate large spikes, most clearly so on a horizon of one month. This is due to the
fact that we are predicting forward variance: a large variance event one month in
the future would be recognized by the actual one month forward variance, but the
model would have no idea until after the event has passed and it has seen the vari-
ance data. This explains why the actual one month forward variance is significantly
higher than the estimate from mid-February until mid-March, when a large variance
event occurred (people and markets began to grasp the seriousness of the disease).
To further explore this, we present both the mean and median squared errors below.
Once again, because the raw errors are difficult to interpret on their own, we scale
them by the respective mean or median squared error of the random walk. Under
the same modus operandi as when we evaluated our models, we take the mean or
median error for each index and average across indexes on each horizon.

22
Again, we notice that the model tends to outperform the random walk to a greater
extent at longer horizons, both in terms of mean and median squared error. It is
interesting to note that the model does not outperform to the same extent on a one
day horizon as either ensemble method in our testing sample, overall and during
high volatility periods. Because we are measuring error relative to the random walk,
this suggests a very strong volatility clustering effect because historical variance is
a good predictor. This also may signal a regime shift, which makes sense given the
current macroeconomic environment and the drastic change in market sentiment
once COVID cases started to ramp up and people began to grasp the seriousness of
the disease. In retrospect, we could have chosen a smaller trailing window size than
was used during the evaluation of our model.

5.2 VaR
Now, we discuss the performance of our VaR estimate as a risk measure. More
precisely, we aim to verify that the VaR measure is truly violated only α% of the
time as well as demonstrate its independence. The table below shows the number
of times the return was below the VaR during the out of sample testing. We recall
that we expect the VaR1% to be violated between 0 and 1 time and the VaR5% to
be violated between 3 and 4 times.
Violations of the VaR1% and VaR5 %
Index VaR5 % VaR1%
1 day 5 days 22 days 1 day 5 days 22 days
AEX 3 2 1 1 0 0
AORD 4 6 2 2 2 0
BFX 2 3 0 1 0 0
BSESN 2 0 0 1 0 0
BVLG 4 1 0 0 0 0
BVSP 3 0 0 1 0 0
DJI 2 3 0 0 1 0
FCHI 0 0 0 0 0 0
FTMIB 3 2 0 1 1 0
FTSE 2 0 0 0 0 0
GDAXI 3 1 0 2 0 0
GSPTSE 0 5 1 0 0 0
HSI 2 1 0 1 0 0
IBEX 1 0 0 1 0 0
IXIC 2 3 1 1 2 0
KS11 4 0 0 1 0 0
KSE 8 2 0 4 1 0
MXX 5 1 0 2 0 0

23
N225 4 2 0 1 0 0
NSEI 2 0 0 1 0 0
OMXC20 3 9 7 2 1 4
OMXHPI 4 3 0 2 0 0
OMXSPI 4 0 0 3 0 0
OSEAX 5 0 0 0 0 0
RUT 1 0 0 0 0 0
SMSI 1 1 0 0 0 0
SPX 1 3 0 0 0 0
SSEC 5 6 12 2 1 7
SSMI 2 1 0 1 0 0
STI 6 1 0 2 0 0
STOXX50E 4 1 0 1 0 0
AVERAGE 2.9 1.8 0.7 1.1 0.3 0.4

The VaR measure works as expected, except for OMXC20 and SSEC. We show
below the time series of their respective VaR.

Figure 17: OMXC20 weekly returns, VaR5 % and VaR1 %

24
Figure 18: SSEC weekly returns, VaR5 % and VaR1 %

We can see from the figures above that the VaR measures for the OMXC20 and
SSEC have been violated only twice, but for periods longer than 1 day, resulting
in the high number of violations. Therefore, the independent number of times our
estimate is violated remains within the expected window. However, in these two
cases, the VaR estimates did not react quickly enough to large negative returns and
remained too small for a period.
Our observations are much better for the rest of the indexes, however. For further
qualitative analysis, we are going to focus specifically on the 1-day VaR for the S&P
500 and the Nikkei 225.

25
(a) Time Series (b) Violations

Figure 19: SPX weekly returns, VaR5 % and VaR1 % and Violations

(a) Time Series (b) Violations

Figure 20: N225 weekly returns, VaR5 % and VaR1 % and Violations

We notice a few interesting and desirable features of the VaR estimate. First of
all, the VaR estimate tends to react very quickly to larger deviations of returns.
Additionally, the estimate restores itself to a normal environment as soon as returns
become less volatile. It is trivial why the first point is important, but the second
is also extremely desirable: in the financial industry, the Value at Risk measure
controls the volumes and the gross exposure of any transactions a hedge fund takes
part in. Having a VaR estimate remain undesirably high for a long period after a
shock can potentially lead to smaller returns for a fund.
We can also see in the table, that VaR measures over longer horizons violated less
often than the ones for shorter horizons. This is a consequence of the central limit

26
theorem, which implies that returns for longer periods present lighter tails. There-
fore, the estimate becomes more conservative.
We conclude that the VaR estimates are satisfactory overall and provide very rea-
sonable measures of risk, even in this turbulent environment.

5.3 Feature Importances


We will again explore the feature importances for the S&P 500 and Nikkei 225 at
all lags, beginning with the S&P:
Horizon (Days) Model 1 Model 2 Model 3
1 VIX SPX 1 day DJI 1 day
0.44 0.13 0.07
5 VIX IXIC 1 day DJI 1 day
0.53 0.04 0.03
22 VIX US COVID Cases IXIC 22 day
0.15 0.14 0.07
S&P 500 Model Features Importances
We see some similar trends here to our analysis during model evaluation. The
model finds that the VIX is always the most important covariate. However, the
importance is now most pronounced on a horizon of one week, with its relative
importance decreasing as we increase the predictive horizon to one month.
We also notice two additional clear differences. First, on horizons of one and five
days, the model includes the DJI, another US index. This indicates that the variance
of the S&P 500 was either more highly correlated with other US indexes or less highly
correlated with global indexes during this period. We also notice that our additional
covariate, COVID-19 infection data, did not add much value on short horizons, but
was approximately equally as important as the VIX for predicting variance on a
horizon of one month. This suggests that COVID data has longer term implications
on market variance than just a few days.
Below, we present the same table for the Nikkei 225:
Horizon (Days) Model 1 Model 2 Model 3
1 N225 5 day N225 1 day SPX 1 day
0.34 0.12 0.02
5 N225 5 day N225 1 day OMXC20 22 day
0.29 0.13 0.04
22 FTSE 22 day Japan COVID Cases MXX 22 day
0.22 0.10 0.07

27
Nikkei 225 Model Features Importances
Again, we notice some similar trends. The model finds that past Nikkei variance is
the most important predictor for future Nikkei variance on shorter horizons, with
more evidence of volatility spillover on a horizon of one month. We again notice that
COVID-19 cases in Japan become an important predictor on a one month horizon,
again suggesting the long term impact of the disease on market variance.

5.4 True Out of Sample (May 1 to May 22)


We apply a similar, though slightly altered methodology for our true out of sample
predictions. Our testing predictions were predicated on having trading data up
until the current date to make predictions for cumulative variance over the next
day, week, and month. Now, our goal is to forecast daily realized variances. To do
so, we consider two methods:
• Method 1: Forecast cumulative variances each day from two to 17 days for-
ward, again using a Random Forest for days 2-10 and a Boosted Tree thereafter.
Because we only have data available up until April 29th, we need to forecast
T+2 to begin our predictions on May 1st, through T+17 for May 22nd. Be-
cause we are forecasting cumulative variances for each day, we will subtract
the predicted total up until that date for each forecast. In the unlikely but
possible scenario that this gives us a negative value for variance on a certain
date, we will instead divide the forecasted cumulative sum by the total number
of days forward and use the average predictive daily variance.
• Method 2: Because our objective is to minimize mean squared error over the
whole sample, we will simply project the cumulative variance from May 1st to
May 22nd and use the daily average for each day’s projection.
In order to test which of these two methods performs better on our relatively small
sample of data during the COVID crisis, we perform a backtest and measure results
for the months of February, March, and April. The results are presented in the table
below.
Month (2020) February March April
Method 1 Sq. Error 4.0E-9 4.3E-6 2.6E-7
Method 2 Sq. Error 2.6E-9 1.7E-7 3.3E-8
M1/M2 Sq. Error Ratio 1.53 25.94 7.73
Out of sample mean squared errors of different methods
We see that method 2 clearly outperforms and so we will forecast the cumulative
variance from May 1st to May 22nd and use the average daily variance for each

28
day’s forecast. For an illustration of why this is the case, let us imagine that we
had only two points in a different setting: one method forecasted one point perfectly
but mispredicted the second point by 10, and another method mispredicted both
points by 7. In this case, the second method would have a lower mean squared
error. Because our models are trained to predict cumulative variance, to minimize
MSE while predicting daily variances more than one day in advance, we should take
the average of the full window’s cumulative predicted variance for each day (see
associated Python code and Excel spreadsheet for forecasts).

6 FURTHER IMPROVEMENTS
We identify the following areas for further improvement:
• Add a variable to indicate the current regime. We intended to address the mar-
ket’s asymmetric response to good and bad news by splitting up the model into
one that outperformed during large drawdown periods and one that outper-
formed during other periods, but found that the same models always performed
best. While the inclusion of the 10-year/2-year treasury spread as a covariate
partially managed this concern, adding a column of 1 and -1’s would allow the
regime to further be reflected in the data set itself. The high volatility regime
could also be defined as whether or not variance is more than two standard
deviations above its one year moving average.
• Test models during the early stages of the COVID-19 crisis, or specifically
test which models performed most accurately during past pandemics using
hospitalization data. Because COVID-19 did not exist for the large majority
of our training sample, there is a chance that another model would capture
the relationship between COVID data and market variance more accurately.
• Use a smaller lookback window for our out of sample predictions to reflect the
swift change in market sentiment.

7 CONCLUSION
Even putting aside the humanitarian and social effects and focusing only on the
economic, this crisis is not the same as the other large drawdowns we’ve seen in the
past two decades. The market swings caused by the COVID-19 crisis have come
swifter than any in past memory and at least as harshly. We have demonstrated
that COVID data has long term implications on market variance across the globe.
This translates into extreme market uncertainty, both for portfolio managers and
individual investors.

29
However, we have illustrated that a horizon-dependent combination of ensemble
methods can provide reasonable variance estimates as well as reliable risk metrics
in both the short and longer term. These methods strongly outperform our bench-
marks, the random walk model and the heterogeneous autoregressive model, as well
as penalized regression methods and traditional autoregressive econometric models.
The associated risk metrics are reliable and independent, and should calm investors’
nerves in this turbulent period and assist in making investment decisions.

30
8 APPENDIX
8.1 List of Indexes
AEX, AORD, BFX, BSESN, BVLG, BVSP, DJI, FCHI, FTMIB, FTSE, GDAXI,
GSPTSE, HSI, IBEX, IXIC, KSL1, KSE, MXX, N225, NSEI, OMXC20, OMXHPI,
OMXSPI, OSEAX, RUT, SMSI, SPX, SSEX, SSMI, STI, STOXX50E

8.2 List of VIX Equivalents


VAEX, AS51VIX, VXD, VCAC, IVMIB30, IVUKX30, VHSI, VIBEX, VXN, VKOSPI,
RXV, VIX, VXEFA, VXEEM, VXFXI, VXEWZ

8.3 Code
The associated Python code is attached in PDF form.

31
References
[1] Jose A. Lopez. “Methods For Evaluating Value-At-Risk Estimates”. In: Fed-
eral Reserve Bank of New York (). url: https : / / www . newyorkfed . org /
medialibrary/media/research/staff_reports/research_papers/9802.
pdf.

32

View publication stats

You might also like