Professional Documents
Culture Documents
A R T I C L E I N F O A B S T R A C T
Keywords: Deep learning technology is rapidly adopted in financial market settings. Using a large data set from the Chinese
Transformer model stock market, we propose a return-risk trade-off strategy via a new transformer model. The empirical findings
Asset allocation show that these updates, such as the self-attention mechanism in technology, can improve the use of time-series
SHAP
information related to returns and volatility, increase predictability, and capture more economic gains than other
Chinese stock market
nonlinear models, such as LSTM. Our model employs Shapley additive explanations (SHAP) to measure the
“economic feature importance” and tabulates the different important features in the prediction process. Finally,
we document several economic explanations for the TF model. This paper sheds light on the burgeoning field on
asset allocation in the age of big data.
1. Introduction (2017), and it is a recently proposed model for dealing with nonlinear
data.
In recent years, new predictive deep learning techniques have been Notably, the self-attention mechanism improves the performance of
rapidly updated and adopted in a wide range of industries to improve the TF. It can observe all the fundamental information of the stock at the
profitability. These applications have raised concerns about (1) whether same time and then filter the key information, just as human eyes can
a successful model in one domain is suitable for another complex quickly scan the images and obtain the areas that need to be focused.
environment, such as the financial market. (2) Why does the new Therefore, the TF model is more likely to select the important and
technique perform better than the traditional models, and how is this critical information from fundamental and macro signals, and the final
achieved? In this paper, we study asset allocations with newly developed forecasting results are closer to the real return or volatility distributions
deep learning techniques in financial markets. We do so by developing than the traditional neural network model.
an interpretable transformer framework, conducting empirical analysis For better model training, the data sample in our paper combines 72
on a large data set from the Chinese stock market, and undertaking the firm characteristics with 8 macro features following Jiang, Tang, and
economic ground of deep learning in portfolio selection. Zhou (2018) and Fisher, Martineau, and Sheng (2022), covering all
There is a large body of literature on asset allocations with expected Chinese A-share stocks from January 2000 to December 2019, and ex
return and volatility, such as optimal stock-to-cash allocation (Kanndel cludes the bottom 20% of stocks in terms of firm size at the beginning of
& Stambaugh, 1996), stock index allocation (Marquering & Verbeek, each sample year to minimize the small firm effect and shell-value
2004) and portfolio efficiency (Ferson, Siegel, & Wang, 2019; Moreira & contamination (Liu, Stambaugh, & Yuan, 2019). The average monthly
Muir, 2017). However, those studies implicitly assume linear relations number of firms is 1824, and their average monthly return is 1.17%.
in the prediction process and lack suitable frameworks in a large data Several empirical findings are worth noting. First, in terms of pre
set. To address the shortcomings, this paper tracks Pinelis and Ruppert dictive power, the TF model significantly performs better than tradi
(2022), who applied machine learning to the allocation of indices and tional models in terms of predicting stock return and volatility. The
risk-free assets for the first time and introduces an interpretable trans traditional machine learning model has the highest R2 in return fore
former model (TF) as a new asset pricing framework. The TF model is a casting at 1%. In the TF model, the best performing R2 is 2.1%. In terms
novel powerful deep learning method developed by Vaswani et al. of volatility prediction, the best performing traditional machine learning
* Corresponding author at: School of Economics, Minzu University of China, Beijing, China.
E-mail address: mark8938@qq.com (T. Ma).
https://doi.org/10.1016/j.irfa.2023.102876
Received 2 February 2023; Accepted 1 August 2023
Available online 3 August 2023
1057-5219/© 2023 Elsevier Inc. All rights reserved.
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
is the elastic network with R2 =92.21%, while the TF model (encoder = 2. Methodology
1–5) predicts an R2 between 96.77% and 98.36%. The higher
improvement in volatility forecasting highlights the advantage of the TF 2.1. Transformer model algorithms1
model in time-series autocorrelation prediction. In addition, the largest
Sharpe ratio using the TF model is 2.75, which is also higher than other The transformer model (TF), a type of unsupervised deep learning,
machine learning models, demonstrating the success of the TF model in has been widely used in various areas of artificial intelligence in recent
the emerging market. years. A typical TF consists of two modules, an encoder and a decoder.
Second, focusing on model interpretation, we propose the Shapley Many variants are subsequently derived, such as Realformer, Performer,
additive explanations (SHAP) value (Lundberg & Lee, 2017) to calculate Lazyformer, Bidirectional Encoder Representation from TFs (BERT), etc.
the importance of each feature in the model. The SHAP values method is (Lin, Wang, Liu, et al., 2021). Following Vaswani et al. (2017), we
a feature attribution method that infers how a set of features and the construct the TF model for financial asset allocation by modifying the
process of prediction are related, which is helpful for interpreting the model according to its encoder part. As shown in Fig. 1, our model in
model performance. In terms of the SHAP value in return prediction, cludes an input embedding layer, three encoders, and a dense layer. The
fundamental signals such as earn yield and gross margins are extremely equation of the model is as follows:
important, indicating that firm quality has a great impact on stock ( )
rt = ft zi,t− 1 : θ + εi,t , (1)
returns. In contrast, the important features in stock volatility forecasting
are more associated with the macro environment, such as bond yields, ( )
σ t = ft zi,t− 1 : θ + ϵi,t , (2)
net exports and GDP. The difference in feature importance highlights the
necessity of return and volatility forecasting separately (Fabian &
where zi,t− 1 is the vector of the firm and macro factors.ft represents the
Marcel, 2022).
TF deep learning model taking into account the nonlinear relationship
Finally, we conduct several analyses to exploit the economic ground
between variables. (See Figs. 2–4.)
of the TF model. First, in additional portfolio-level analysis, using
fundamental and macro signals with a self-attention mechanism, the TF
2.1.1. Input embedding layer
model shows better performance on large and value firms. Second, in
Following Kazemi et al. (2019), we propose Time2Vec, a represen
terms of prediction error, we find that the prediction error of the TF
tation for time that has three identified properties to embed the stock
model is insignificant (of the 72 portfolio samples, only 4 were signifi
features onto a higher D-dimensional space. For a given scalar notion of
cant), while the error of the traditional model such as LSTM is largely
time t, Time2Vec of t, denoted as t2v(t), is a vector of size k + 1 defined
significant (of the 72 portfolio samples, 34 were significant) with the
as follows:
value of error close to the level of characteristic-managed portfolio re {
turn. This suggests that the expected returns predicted by the TF model ωi t + φi , if i = 0
t2v(t)[i] = (3)
are closer to the true returns of portfolios and stocks than traditional sin(ωi τ + φi ), if 1 ≤ i ≤ k
deep learning models, which may be due to the unique self-attention
mechanism being able to better detect the time-series relations in mar where t2v(t)[i] is the ith feature of t2v(t ), sin is the sine periodic activa
ket anomalies. Finally, the TF model outperforms in intraindustry, tion function, and φi s and φi s are learnable parameters. The linear term
indicating that our model focuses more on stock selection than industry
rotation.
Our study has several contributions. First, this paper extends to the
use of machine learning with big data for asset allocation and asset
pricing (Chen, Pelger, & Zhu, 2021; Giglio, Kelly, & Xiu, 2022; Gu, Kelly,
& Xiu, 2020; Leippold, Wang, & Zhou, 2022; Pinelis & Ruppert, 2022;
Ma, Leong, & Jiang, 2023). As prime examples, Pinelis and Ruppert
(2022) introduce a utility maximizing investment strategy using random
forest to predict the return and volatility. Our study takes the lead to
introduce the TF model as the pricing model. Compared with traditional
machine learning models, the TF model has the characteristics of
self-concentration, which increases its prediction performance. Second,
we introduce the interpretable mechanism to the deep learning frame
and show the economic grounds of the machine learning-based pricing
model. Recent studies highlight the interpretable machine learning
model on the financial market (e.g., Fan, Ke, Liao, & Neuhierl, 2022;
Kapetanios & Kempf, 2022). We extend the research to use SHAP values
to visualize feature contributions and find the different feature impor
tance in return and volatility prediction. Finally, our paper echoes asset
pricing studies and extends the economic explanations of machine
learning in the Chinese stock market (Leippold et al., 2022; Liu et al.,
2019; Ma, Liao, & Jiang, 2023).
The rest of this paper is organized as follows. Section 2 describes the Fig. 1. TF model.
data and method, Section 3 presents the empirical findings, Section 4 This figure shows the internal structure of our constructed TF model, which
delineates the economic ground, and Section 5 summarizes the full consists of 1 embedding layer, several encoder and hidden layers, and 1 output
paper. layer. Where the input data are n features of P stocks.
1
We tabulate the hyperparameters setting of our models in appendix with
table OA3.
2
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
3
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
each vector of z. Moreover, considering that the multihead attention characteristic indicators: valuation and growth, investment, earnings,
mechanism may not fit complex processes well enough, the encoder is inertia, trading friction and intangible assets.3 Additionally, we also
enhanced by adding two layers of convolutions of kernel size 1. calculate the lag returns and volatilities with 1 and 2 months considering
the autocorrelation in variables (Pinelis & Ruppert, 2022).
2.1.3. Output layer To investigate whether macro factors affect asset allocation, we
First, the features passing through the encoder layer are subjected to select an additional 8 macro factors to add to our data set, resulting in a
global average pooling (GAP), which flattens the remaining dimensions total of 80 variables in our data set. For the market portfolio used as a
to reshape the tensor. After that, these features are then sent to two fully risk asset in our paper, we assign different weights to each stock based
connected layers to output the predicted results. on its market size, thus constructing a portfolio whose β is 1, which
The first dense layer is as follows: means that its systematic risk is equal to the systematic risk of the
( ( )′ ) market.
Zkl = g bl− 1 + Z l− 1 W l− 1 , (7)
where g(•) is the nonlinear “activation function” to take the aggre 2.3. Asset allocation
gated signal from the previous layer and send it to the next layer.
We apply the rectified linear unit (ReLU) as the nonlinear active When we weight assets between risky and risk-free assets, we use a
function: conditional asset allocation strategy model, which can be presented by
the following equation:
ReLU(Zk ) = max(Zk , 0). (8)
[ ]
The final output is a linear transformation of the last dense layer E rt − rt−f 1 |Ft− 1
wt = (10)
output: γ⋅σ [rt |Ft− 1 ]
2
( )′
D(Z, b, W) = bl− 1 + Z l− 1 W l− 1 . (9) where rt represents the expected value-weighted market return in month
f
t with machine learning, rt− 1 represents the risk-free yield in month t-1,
We apply the first dense layer with 64 neurons. Moreover, to avoid f
overfitting, we add one dropout layer disabling a portion of neurons and we define rt− 1as the one-year Treasury bond yields. The parameter γ
between the two dense layers with a 0.1 dropout rate. is assumed to be positive and reflects the risk aversion content. We use 3
The gradient descent algorithm is generally used to optimize the as the risk aversion content in our asset allocation (Pinelis & Ruppert,
objective function. For a given function L(θ), the algorithm minimizes 2022). σ2 is the expected volatility of stock returns at time t. wt is the
L(θ) by updating θ along the first-order derivative of the function, i.e., optimal weight we invest in risk assets.
the opposite direction of the gradient, i.e., θ = θ − η∇ θ L(θ), where η is
the set iteration step, and we use the stochastic gradient descent (SGD) 3. Empirical results
method to optimize the TF model.
In our main empirical analysis, we use the TF model to predict two 3.1. Out-of-sample R2 and MSFE
things: the first is to predict the value-weighted market portfolio’s return
based on firm-specific characteristics and macro factors, and the second We first examine the forecasting power of the TF model with
is to apply the same method to predict the portfolio’s volatility. To different numbers of encoders, along with other machine learning
pursue the economic grounds of the TF model and robust analysis, we models. To characterize the predictive power of the different models, we
also consider the prediction on characteristic-managed portfolios and construct an out-of-simple R2 . The out-of-sample R2 for returns is
industry portfolios in Section 4. calculated as
For comparison, we include principal component analysis (PCA), T (
∑
N ∑ )2
elastic net (Enet), random forest (RF) and long short-term memory with ri,t − ̂
ri,t
three layers (LSTM), with their algorithms displayed in Bali, Goyal, R2 = 1 − i=1 t=1
, (11)
Huang, Jiang, and Wen (2022).2 In particular, LSTM has a similar
∑N ∑
T
ri,t 2
attention unit as the TF model, named the memory unit, which is a good i=1 t=1
2 3
For details in our applications, we tabulate the hyperparameters setting of We report the description and construction of each characteristic in ap
these models in appendix with table OA3. pendix with table OA1 and OA2.
4
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
Table 1 Wang, 2018; Engle, Ghysels, & Sohn, 2013). However, compared to the
Out-of-sample R2 and MSFE for TF models. macroeconomic attention indices (MAI) constructed by Fisher et al.
Model R2 (%) MSFE (%) (2022), among eight categories, our results show that only Bd, GDP, and
CPI have a significant impact on stock volatility as monetary policy,
Return
LR − 11.73 2.14
output growth, and inflation, respectively.
EN 0.71 2.05
PCA 0.80 2.01 3.3. Asset allocation with the transformer model
LSTM 0.96 1.98
RF 1.00 2.00
This paragraph discusses the performance of out-of-sample in
TF(E = 1) 1.89 1.86
TF(E = 2) 1.49 1.88 vestments calibrated by machine learning on a risk-adjusted basis and
TF(E = 3) 2.10 1.90 provides a relevant comparison. We invest one dollar as an investor in
TF(E = 4) 2.09 1.90 early 2010 while specifying a risk aversion factor of 3 and plot the cu
TF(E = 5) 1.74 1.85 mulative returns to each strategy in Figs. 5 and 6 without short selling in
Volatility
LR 87.85 0.055
100% and 150% leverage constraints, respectively.
EN 92.21 0.027 As shown in Fig. 6, the final December 2019 returns corresponding to
PCA 91.74 0.033 the RF, EN, LR, LSTM, PCA, and TF with encoders from 1 to 5 are $4.39,
LSTM 89.86 0.032 $2.93, $2.47, $2.62, $2.52, $4.14, $4.68, $4.57, $4.90 and $4.53,
RF 88.29 0.030
respectively. Compared to the EN, LR, LSTM and PCA models, the RF
TF(E = 1) 97.87 0.035
TF(E = 2) 98.27 0.029 and TF models captured the 2015 market expansion well, with the TF
TF(E = 3) 98.36 0.026 models whose encoders are 2, 3 and 4 ranking in the top three highest
TF(E = 4) 98.29 0.027 points of all models in 2015. At the same time, in the first two gray areas
TF(E = 5) 96.77 0.054 representing rapid declines, the downward trend of the LR, PCA, LSTM
This table presents the R2 and MSFE of the OLS model and other machine and EN models is relatively flat, while the downtrend of the RF and TF
learning models compared to the TF with different numbers of encoder layers models is relatively steep, indicating that the volatility of LR, PCA, LSTM
(E), from one to five. and EN is relatively small overall. In the third gray area, LR has the
largest descent slope and the worst performance. As shown in Fig. 7, the
RF and TF models exhibit a higher slope at each stage of earnings growth
forecasting errors of the TF models are under 2.00% compared with the than the LR, EN, and PCA models, indicating that they are more accurate
other models. in predicting future trends.
For volatility forecasting, the R2 of the models are relatively high, all Overall, the TF model outperforms the other models. The traditional
above 87%, representing that they have excellent predictive skills, while machine learning models only perform well when the training and test
TF models have the best prediction performance exceeding 95%,as samples are of the same type, and the self-attention mechanism in TF can
well as the lowest mean square forecasting error with 3 encoders of capture the dynamic changes in data and generate more stable
0.026%. predictability.
5
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
Table 3 contains the average annual realized utilities, certainty- 3.6. Transaction cost
equivalent (CE) yields computed as the inverse utility function of the
average realized utility, and terminal wealth for different lever Transaction cost is always considered in real investment. In this
coefficients. section, we investigate the asset allocation results after deducting a
The linear regression model exhibits the worst utility at 0.229 and proportional transaction cost of 20, 40, 60 and 80 basis points. Table 4
0.209 for both the 100% leverage limit and the 150% leverage limit, reports the results with different leverage ratios and transaction costs.
respectively. The TF model strategy (E = 4) exhibits the best utility and We find that in terms of absolute transaction costs, the absolute
highest CE yield of 0.314 and 5.19% when the leverage is 100%, with RF transaction costs of a machine learning model with a leverage of 1 were
model strategy performance second, with a utility of 0.310 and CE yield smaller than the absolute transaction costs of a machine learning model
of 5.19%. At a 150% leverage limit, the TF model (E = 4) also achieves with a leverage of 1.5. On the one hand, the initial principal of the
the best utility and highest CE yield of 0.327 and 5.21%, with the TF machine learning model with a leverage ratio of 1 is relatively small, and
model (E = 3) and TF model (E = 5) ranking second and third, respec on the other hand, the double leverage ratio limits the ordinary position
tively. This suggests that the model that gives investors the highest adjustment of funds. In terms of relative transaction costs, the relative
utility may also differ for different leverage limits. The RF model and the transaction cost of a machine learning model with a 150% leverage limit
TF model perform similarly over a long period of time with no leverage is smaller than that of a machine learning model with a 100% leverage
restrictions, while the opposite is true with 150% leverage restrictions. limit, which shows that when we capture a high-yield opportunity, we
However, the TF model ultimately shows a higher final return. can choose to bear the cost of position adjustment to obtain higher
returns. In addition, the linear regression model has the largest change
in relative transaction cost with basis point, indicating that the other
6
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
Fig. 7. Cumulative returns of reward-risk timing to linear regression (150% leverage limitation).
This figure plots the cumulative returns for the portfolio of each model from 2010 to 2019 with 150% leverage. We include linear regression (LR) as the benchmark.
The initial investment of 1 buck.
models have a smaller adjustment compared to the linear model. Among more stable.
the TF models, those with an encoder equal to 3 have the smallest
change in transaction cost with the basis point, and their investments are
7
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
Table 2 Table 4
Sharpe ratios and max downgrades. Transaction costs of portfolio allocation.
Strategy Annual Standard Sharp Max Downgrade Strategy Terminal Wealth Sharpe ratio
Return (%) Deviation (%) Ratio (%)
Leverage =1
Leverage = 1 Transaction 20 40 60 80 20 40 60 80
LR 9.46 4.56 1.36 19.3 cost bps bps bps bps bps bps bps bps
EN 11.3 3.80 2.13 20.2 LR 2.34 2.21 2.10 1.99 1.23 1.10 0.97 0.85
PCA 9.68 3.70 1.73 21.2 EN 2.88 2.83 2.79 2.74 2.09 2.04 1.99 1.94
LSTM 10.15 3.62 1.90 22.8 PCA 2.47 2.42 2.37 2.32 1.68 1.61 1.55 1.49
RF 15.9 5.09 2.49 35.1 LSTM 2.62 2.60 2.60 2.58 1.90 1.88 1.88 1.86
TF(E = RF 4.26 4.13 4.01 3.89 2.42 2.35 2.29 2.22
15.2 5.06 2.37 47.0
1) TF(E = 1) 3.99 3.84 3.69 3.54 2.29 2.20 2.11 2.02
TF(E = TF(E = 2) 4.52 4.36 4.20 4.04 2.58 2.49 2.40 2.31
16.7 5.06 2.65 52.3
2) TF(E = 3) 4.45 4.33 4.22 4.11 2.51 2.45 2.39 2.33
TF(E = TF(E = 4) 4.71 4.52 4.33 4.14 2.67 2.56 2.45 2,34
16.4 5.12 2.57 35.3
3) TF(E = 5) 4.39 4.25 4.11 3.97 2.33 2.25 2.17 2.09
TF(E =
17.2 5.07 2.75 54.0
4)
TF(E =
16.3 5.46 2.39 51.3 Leverage =1.5
5)
Leverage = 1.5 Transaction 20 40 60 80 20 40 60 80
LR 8.97 5.09 1.12 26.2 cost bps bps bps bps bps bps bps bps
EN 11.6 3.83 2.18 32.4
PCA 9.68 3.70 1.73 29.6 LR 2.22 2.09 1.97 1.85 1.00 0.87 0.74 0.61
LSTM 10.1 3.62 1.91 29.6 EN 2.96 2.91 2.86 2.81 2.14 2.09 2.04 1.99
RF 20.0 6.40 2.62 67.3 PCA 2.47 2.42 2.37 2.32 1.68 1.61 1.55 1.49
TF(E = LSTM 2.60 2.58 2.56 2.54 1.88 1.86 1.84 1.82
19.9 6.36 2.62 64.9 RF 6.01 5.79 5.59 5.39 2.56 2.49 2.43 2.36
1)
TF(E = TF(E = 1) 5.94 5.71 5.48 5.25 2.55 2.48 2.41 2.34
22.7 6.60 2.96 71.2 TF(E = 2) 7.51 7.23 6.95 6.67 2.89 2.82 2.75 2.68
2)
TF(E = TF(E = 3) 7.21 7.00 6.80 6.61 2.90 2.85 2.79 2.74
22.1 6.40 2.95 70.3 TF(E = 4) 7.60 7.29 6.98 6.67 3.00 2.93 2.86 2.79
3)
TF(E = TF(E = 5) 6.96 6.69 6.42 6.15 2.63 2.56 2.49 2.42
22.9 6.43 3.06 71.9
4)
This table reports the impact of transaction costs on the monthly return (in %)
TF(E =
21.8 6.90 2.70 69.6 and the annualized Sharpe ratio of the portfolio strategies based on different
5)
machine learning algorithms.
This table shows the out-of-sample annual returns, standard deviations, Sharpe
ratios and max downgrades for the test period from 2010 to 2019 for the trading
level samples to seek the economic ground and check the model’s
rules.
robustness.
8
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
TF(E = 5) 0.38 1.75* 1.42 0.41 1.82* 1.72 the portfolios over the next month. Then, we analyze the realized return
M TF(E = 1) 0.94 3.39*** 1.1 0.99 3.45*** 1.45 spread between the return portfolio and safety portfolio, denoted as
TF(E = 2) 0.90 3.33*** 1.16 0.94 3.35*** 1.66 WML.
TF(E = 3) 0.97 3.48*** 1.23 1.02 3.53*** 1.75
TF(E = 4) 0.94 3.34*** 1.28 0.98 3.37*** 2.11 1 ∑Nt ( )
TF(E = 5) 0.94 3.38*** 1.2 0.99 3.43*** 1.68
WMLt+1 = ̂r i,t − ̂r m,t ri,t+1 (18)
Ht i=1
L TF(E = 1) 1.97 5.84*** 0.71 1.99 5.66*** 1.2
TF(E = 2) 1.90 5.70*** 0.67 1.92 5.51*** 1.14 1∑Nt
TF(E = 3) 1.99 5.86*** 0.78 2.01 5.69*** 1.57 Ht = ∣̂r i,t − ̂r m,t ∣ (19)
TF(E = 4) 2.02 5.80*** 0.79 2.04 5.63*** 1.61
2 i=1
TF(E = 5) 1.90 5.67*** 0.63 1.91 5.47*** 0.96 The weight of each stock is proportional to the stock’s model-
This table shows the excess returns and their t values for different portfolios in predicted return on a market-adjusted basis, with higher weights for
the FF3 (Fama & French, 1993) and FF5 (Fama & French, 2015) at the encoder higher return stocks in the long leg and more negative weights for lower
level of the TF model from 1 to 5. The portfolios are divided by 3:4:3 according return stocks in the short leg. The result is scaled by the inverse of the
to the size of the market size to book-to-market ratio. S1 denotes the portfolio of sum of absolute deviations of stock returns from the market average for
small stocks with 1 encoder layer. Relative returns are a concrete embodiment of standardization.
the content of Fig. 9. ***, **, and * indicate significance at the 1%, 5%, and 10% Next, following Hameed and Mian (2015) and Avramov et al. (2022),
levels, respectively.
we decompose the return spread into two components. WML can be
rewritten as:
9
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
1 ∑Nt ( ) industry average, i.e., ̂r i,t − ̂r j,t < 0. The second term represents the
WMLt+1 = ̂r i,t − ̂r ind,t + ̂r ind,t − ̂r m,t ri,t+1
Ht i=1 interindustry return spread that takes a long position on industries that
1 ∑Nt ( ) 1 ∑Nt ( ) are expected to confront higher returns than the market average, i.e.,
= ̂r i,t − ̂r ind,t ri,t+1 + ̂r ind− j,t − ̂r m,t ri,t+1 (20) ̂r j,t − ̂r m,t > 0 , and takes a short position on industries that are expected
Ht i=1 Ht i=1
to retain lower returns than the market average, i.e., ̂r j,t − ̂r m,t < 0.
1 ∑Nt ( ) 1 ∑Lt ( )
Following the same method, we also construct the risky minus safety
= ̂r i,t − ̂r ind,t ri,t+1 + ̂r ind− j,t − ̂r m,t Nj,t rj,t+1
Ht Ht portfolios (RMS) considering the volatility forecasting process, that is,
i=1 j=1
where ̂r ind− j,t refers to the equal-weighted average of ̂r i,t across all stocks with higher risk than the market average, i.e., vol ̂ i,t − vol
̂ m,t > 0,
∑Nj,t
stocks in industry j. That is, ̂r j,t = N1t i=1 ̂r i,t , where Nj,t refers to the defined as risk portfolio (R), and stocks that are expected to have lower
number of stocks in industry j. Lt refers to the number of industries, and crash risk than the market average, i.e., vol̂ i,t − vol
̂ m,t < 0, defined as
rj,t+1 refers to the equal-weighted average of stock returns across all safety portfolio (S).
stocks in industry j in montht + 1. The results are tabulated in Table 6, and several findings are worth
The first term in Eq. (20) represents the intraindustry return spread noting. First, all machine learning methods perform better in intra
that takes a long position on stocks that are expected to confront higher industry samples than in interindustry samples. For instance, the return
returns than the industry average, i.e., ̂r i,t − ̂r j,t > 0, and takes a short spread of the intraindustry WML portfolio accounts for 90% of the total
position on stocks that are expected to retain lower returns than the risk spread in WML using the TF3 method (0.038 out of 0.042), which
10
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
Table 6
Transformer model attribution in intra- and interindustry.
LSTM TF1 TF2 TF3 TF4 TF5
Panel A: Return
0.065** 0.056** 0.070*** 0.052*** 0.063***
Winner 0.078(1.37)
(2.45) (2.46) (2.94) (2.55) (2.60)
0.030 0.030 0.024 0.028 0.024 0.027
Loser
(1.33) (1.33) (1.31) (1.52) (1.30) (1.50)
0.037*** 0.033*** 0.042*** 0.028*** 0.036***
WML 0.029(1.00)
(3.06) (3.83) (3.88) (3.64) (2.92)
0.032*** 0.030*** 0.038*** 0.026*** 0.032***
WML (intraindustry) 0.028(0.96)
(3.11) (3.63) (3.90) (3.67) (2.93)
0.004 0.004 0.004 0.002 0.005
WML (interindustry) 0.001(0.02)
(0.39) (0.43) (0.38) (0.30) (0.42)
Panel B: Volitality
0.169*** 0.167*** 0.168*** 0.169*** 0.173*** 0.169***
Risk
(20.82) (20.53) (19.82) (20.42) (20.86) (20.82)
0.104*** 0.104*** 0.105*** 0.107*** 0.104*** 0.104***
Safety
(23.79) (23.79) (27.24) (26.69) (26.75) (28.27)
0.065*** 0.063*** 0.063*** 0.062*** 0.069*** 0.065***
RMS
(6.49) (6.28) (6.06) (6.03) (6.83) (6.50)
0.055*** 0.053*** 0.053** 0.052*** 0.058*** 0.055***
RMS (intraindustry)
(6.18) (6.21) (6.01) (5.93) (6.61) (6.18)
0.011 0.01 0.01 0.01 0.011 0.011
RMS (interindustry)
(1.23) (1.18) (1.10) (1.11) (1.29) (1.23)
This table reports volatility and return in risk (loser) portfolios and safety (winner) portfolios, as well as return and volatility spreads in WML (RMS) portfolios and their
decompositions in terms of intra- and interindustry. Panels A and B show the results in terms of the return and volatility measure, respectively. ***, **, and * indicate
significance at the 1%, 5%, and 10% levels, respectively.
illustrates that our method pays more attention to selecting stocks than Acknowledgements
industry rotation. Second, the TF model has better performance than
LSTM in WML portfolios. While the value of RMS is significant at the 1% The authors are grateful for the very constructive comments and
level within all methods, the value of WML is significant only in the TF suggestions from the editor, anonymous reviewers, Chunmin Zhang,
method. Benjian Wu, Xuejun Zhang, Fuwei Jiang, Zhanyu Ying, and the seminar
participants at Minzu University of China.
5. Conclusion
Appendix A. Supplementary data
This paper extends the application of the Transformer (TF) model by
Vaswani et al. (2017) in the Chinese stock market. We compare the Supplementary data to this article can be found online at https://doi.
performance of the TF model and other machine learning models in the org/10.1016/j.irfa.2023.102876.
stock market, delve deeper into the factors that influence stock forecasts
and compare the performance of different subsamples. References
Our findings suggest that a significant portion of Chinese stock
returns can be explained by the TF pricing model. In terms of return and Abbas, G., McMillan, D. G., & Wang, S. (2018). Conditional volatility nexus between
volatility prediction, TF models can show extremely high R2 compared stock markets and macroeconomic variables empirical evidence of G-7 countries.
Journal of Economic Studies, 45(1), 77–99.
to traditional machine learning models. The TF model performs signif Avramov, D., Cheng, S., & Metzker, L. (2022). Machine learning vs. economic
icantly better in asset allocation than other traditional machine learning restrictions: Evidence from stock return predictability. Management Science, 0(0).
models over the out-of-sample and subsamples. The TF model has per Bali, T. G., Goyal, A., Huang, D., Jiang, F., & Wen, Q. (2022). Predicting corporate bond
returns: Merton meets machine learning. In Georgetown McDonough School of Business
fect performance in measuring various indicators, such as the Sharpe Research Paper (3686164) (pp. 20–110).
ratio and investor utility. After considering the transaction costs, the TF Chen, L., Pelger, M., & Zhu, J. (2021). Deep learning in asset pricing. Available from:
model can also obtain sufficient benefits, indicating that the TF model https://doi.org/10.2139/ssrn.3350138.
Engle, R. F., Ghysels, E., & Sohn, B. (2013). Stock market volatility and macroeconomic
can be effectively applied in practice. In addition, in the comparison of fundamentals. Review of Economics and Statistics, 95(3), 776–797.
subsamples, the TF model shows a predicted return closer to the real Fabian, H., & Marcel, P. (2022). Managing the market portfolio. Management Science, 0
return, which shows its strong predictive ability. (0).
Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and
We also provide economic explanations for deep learning with the bonds. Journal of Financial Economics, 33, 3–56.
SHAP value. In return prediction, fundamental characteristics play a Fama, E. F., & French, K. R. (2015). A five-factor asset allocation model. Journal of
more important role, while macroeconomic features show more Financial Economics, 116, 1–22.
Fan, J., Ke, Z. T., Liao, Y., & Neuhierl, A. (2022). Structural deep learning in conditional
importance in volatility forecasting. The TF model is more inclined to
asset pricing. Available from: SSRN 4117882.
predict return and volatility within industries than between industries. Ferson, W. E., Siegel, A. F., & Wang, J. L. (2019). Asymptotic variances for tests of portfolio
Overall, our paper sheds light on the surging literature on asset efficiency and factor model comparisons with conditioning information. University of
allocation in the big data era and has great implications for the efficiency Washington Working Paper.
Fisher, A., Martineau, C., & Sheng, J. (2022). Macroeconomic attention and
of financial markets in emerging markets. Meanwhile, our exploration of announcement risk Premia. Review of Financial Studies, 35(11), 5057–5093.
the TF model is still in the beginning stages, and the decoder mechanism Giglio, S., Kelly, B., & Xiu, D. (2022). Factor models, machine learning, and asset pricing.
can be further improved and broadly adopted in financial studies. For Annual Review of Financial Economics, 14, 337–368.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The
example, Gerald Woo, Liu, Sahoo, Kumar, and Hoi (2022) show that the Review of Financial Studies, 33(5), 2223–2273.
TF variant ETSformer model has a good effect on the smooth analysis of Hameed, A., & Mian, G. M. (2015). Industries and stock return reversals. Journal of
time series analysis. Financial and Quantitative Analysis, 50, 89–117.
Hanauer, M. X., Kononova, M., & Rapp, M. S. (2022). Boosting agnostic fundamental
analysis: Using machine learning to identify mispricing in European stock markets.
Finance Research Letters, 48, Article 102856.
11
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
Jiang, F., Tang, G., & Zhou, G. (2018). Firm characteristics and Chinese stocks. Journal of Ma, T., Leong, W. J., & Jiang, F. (2023). A latent factor model for the Chinese stock
Management Science and Engineering, 3(4), 259–283. market. International Review of Financial Analysis, 87, 102555.
Kanndel, S., & Stambaugh, R. F. (1996). On the predictability of stock returns: An asset, Ma, T., Liao, C., & Jiang, F. (2023). Timing the factor zoo via deep learning: Evidence from
allocation perspective. The Journal of Finance, 51, 385–424. China. Accounting & Finance, 63, 485–505.
Kapetanios, G., & Kempf, F. (2022). Interpretable machine learning modeling for asset pricing Marquering, W., & Verbeek, M. (2004). The Economic Value of Predicting Stock Index
(Working paper). Returns and Volatility. Journal of Financial and Quantitative Analysis, 39(2), 407–429.
Kazemi, S. M., Goel, R., Eghbali, S., Ramanan, J., Sahota, J., Thakur, S., et al. (2019). Moreira, A., & Muir, T. (2017). Volatility-managed portfolios. The. Journal of Finance, 69
Time2Vec: Learning a vector representation of time. Learning a vector representation of (2), 1611–1644.
time. Available from: 10.48550/arXiv.1907.05321. Pinelis, M., & Ruppert, D. (2022). Machine learning portfolio allocation. The Journal of
Leippold, M., Wang, Q., & Zhou, W. (2022). Machine learning in the Chinese stock Finance and Data Science, 8, 35–54.
market. Journal of Financial Economics, 145(2), 64–82. Seavey, S. E., Imhof, M., & Watanabe, O. V. (2016). Proprietary costs of competition and
Lin, T., Wang, Y., Liu, X., et al. (2021). A survey of TFs. Available from: https://www. financial statement comparability. University of Nebraska-Lincoln Working Paper.
arxiv.org/abs/2106.04554v2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017).
Liu, J., Stambaugh, R. F., & Yuan, Y. (2019). Size and value in China. Journal of Financial Attention is all you need. Available from: 10.48550/arXiv.1706.03762.
Economics, 134, 48–69. Woo, G., Liu, C., Sahoo, D., Kumar, A., & Hoi, S. (2022). Etsformer: Exponential smoothing
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model transformers for time series forecasting. Available from: 10.48550/arXiv.2202.01381.
predictions. Neural Information Processing Systems, 30.
12