Attention Is All You Need - Transformer

International Review of Financial Analysis 90 (2023) 102876
Contents lists available at ScienceDirect
International Review of Financial Analysis

journal homepage: www.elsevier.com/locate/irfa
Attention is all you need: An interpretable transformer-based asset

allocation approach
Tian Ma a, b, *, Wanwan Wang a, Yu Chen a
a
School of Economics, Minzu University of China, Beijing, China
b
China Institute for Vitalizing Border Areas and Enriching the People, Beijing, China
A R T I C L E I N F O A B S T R A C T
Keywords: Deep learning technology is rapidly adopted in financial market settings. Using a large data set from the Chinese
Transformer model stock market, we propose a return-risk trade-off strategy via a new transformer model. The empirical findings
Asset allocation show that these updates, such as the self-attention mechanism in technology, can improve the use of time-series
SHAP
information related to returns and volatility, increase predictability, and capture more economic gains than other
Chinese stock market
nonlinear models, such as LSTM. Our model employs Shapley additive explanations (SHAP) to measure the
“economic feature importance” and tabulates the different important features in the prediction process. Finally,
we document several economic explanations for the TF model. This paper sheds light on the burgeoning field on
asset allocation in the age of big data.
1. Introduction (2017), and it is a recently proposed model for dealing with nonlinear
data.
In recent years, new predictive deep learning techniques have been Notably, the self-attention mechanism improves the performance of
rapidly updated and adopted in a wide range of industries to improve the TF. It can observe all the fundamental information of the stock at the
profitability. These applications have raised concerns about (1) whether same time and then filter the key information, just as human eyes can
a successful model in one domain is suitable for another complex quickly scan the images and obtain the areas that need to be focused.
environment, such as the financial market. (2) Why does the new Therefore, the TF model is more likely to select the important and
technique perform better than the traditional models, and how is this critical information from fundamental and macro signals, and the final
achieved? In this paper, we study asset allocations with newly developed forecasting results are closer to the real return or volatility distributions
deep learning techniques in financial markets. We do so by developing than the traditional neural network model.
an interpretable transformer framework, conducting empirical analysis For better model training, the data sample in our paper combines 72
on a large data set from the Chinese stock market, and undertaking the firm characteristics with 8 macro features following Jiang, Tang, and
economic ground of deep learning in portfolio selection. Zhou (2018) and Fisher, Martineau, and Sheng (2022), covering all
There is a large body of literature on asset allocations with expected Chinese A-share stocks from January 2000 to December 2019, and ex
return and volatility, such as optimal stock-to-cash allocation (Kanndel cludes the bottom 20% of stocks in terms of firm size at the beginning of
& Stambaugh, 1996), stock index allocation (Marquering & Verbeek, each sample year to minimize the small firm effect and shell-value
2004) and portfolio efficiency (Ferson, Siegel, & Wang, 2019; Moreira & contamination (Liu, Stambaugh, & Yuan, 2019). The average monthly
Muir, 2017). However, those studies implicitly assume linear relations number of firms is 1824, and their average monthly return is 1.17%.
in the prediction process and lack suitable frameworks in a large data Several empirical findings are worth noting. First, in terms of pre
set. To address the shortcomings, this paper tracks Pinelis and Ruppert dictive power, the TF model significantly performs better than tradi
(2022), who applied machine learning to the allocation of indices and tional models in terms of predicting stock return and volatility. The
risk-free assets for the first time and introduces an interpretable trans traditional machine learning model has the highest R2 in return fore
former model (TF) as a new asset pricing framework. The TF model is a casting at 1%. In the TF model, the best performing R2 is 2.1%. In terms
novel powerful deep learning method developed by Vaswani et al. of volatility prediction, the best performing traditional machine learning
* Corresponding author at: School of Economics, Minzu University of China, Beijing, China.
E-mail address: mark8938@qq.com (T. Ma).
https://doi.org/10.1016/j.irfa.2023.102876
Received 2 February 2023; Accepted 1 August 2023
Available online 3 August 2023
1057-5219/© 2023 Elsevier Inc. All rights reserved.
T. Ma et al. International Review of Financial Analysis 90 (2023) 102876
is the elastic network with R2 =92.21%, while the TF model (encoder = 2. Methodology
1–5) predicts an R2 between 96.77% and 98.36%. The higher
improvement in volatility forecasting highlights the advantage of the TF 2.1. Transformer model algorithms1
model in time-series autocorrelation prediction. In addition, the largest
Sharpe ratio using the TF model is 2.75, which is also higher than other The transformer model (TF), a type of unsupervised deep learning,
machine learning models, demonstrating the success of the TF model in has been widely used in various areas of artificial intelligence in recent
the emerging market. years. A typical TF consists of two modules, an encoder and a decoder.
Second, focusing on model interpretation, we propose the Shapley Many variants are subsequently derived, such as Realformer, Performer,
additive explanations (SHAP) value (Lundberg & Lee, 2017) to calculate Lazyformer, Bidirectional Encoder Representation from TFs (BERT), etc.
the importance of each feature in the model. The SHAP values method is (Lin, Wang, Liu, et al., 2021). Following Vaswani et al. (2017), we
a feature attribution method that infers how a set of features and the construct the TF model for financial asset allocation by modifying the
process of prediction are related, which is helpful for interpreting the model according to its encoder part. As shown in Fig. 1, our model in
model performance. In terms of the SHAP value in return prediction, cludes an input embedding layer, three encoders, and a dense layer. The
fundamental signals such as earn yield and gross margins are extremely equation of the model is as follows:
important, indicating that firm quality has a great impact on stock ( )
rt = ft zi,t− 1 : θ + εi,t , (1)
returns. In contrast, the important features in stock volatility forecasting
are more associated with the macro environment, such as bond yields, ( )
σ t = ft zi,t− 1 : θ + ϵi,t , (2)
net exports and GDP. The difference in feature importance highlights the
necessity of return and volatility forecasting separately (Fabian &
where zi,t− 1 is the vector of the firm and macro factors.ft represents the
Marcel, 2022).
TF deep learning model taking into account the nonlinear relationship
Finally, we conduct several analyses to exploit the economic ground
between variables. (See Figs. 2–4.)
of the TF model. First, in additional portfolio-level analysis, using
fundamental and macro signals with a self-attention mechanism, the TF
2.1.1. Input embedding layer
model shows better performance on large and value firms. Second, in
Following Kazemi et al. (2019), we propose Time2Vec, a represen
terms of prediction error, we find that the prediction error of the TF
tation for time that has three identified properties to embed the stock
model is insignificant (of the 72 portfolio samples, only 4 were signifi
features onto a higher D-dimensional space. For a given scalar notion of
cant), while the error of the traditional model such as LSTM is largely
time t, Time2Vec of t, denoted as t2v(t), is a vector of size k + 1 defined
significant (of the 72 portfolio samples, 34 were significant) with the
as follows:
value of error close to the level of characteristic-managed portfolio re {
turn. This suggests that the expected returns predicted by the TF model ωi t + φi , if i = 0
t2v(t)[i] = (3)
are closer to the true returns of portfolios and stocks than traditional sin(ωi τ + φi ), if 1 ≤ i ≤ k
deep learning models, which may be due to the unique self-attention
mechanism being able to better detect the time-series relations in mar where t2v(t)[i] is the ith feature of t2v(t ), sin is the sine periodic activa
ket anomalies. Finally, the TF model outperforms in intraindustry, tion function, and φi s and φi s are learnable parameters. The linear term
indicating that our model focuses more on stock selection than industry
rotation.
Our study has several contributions. First, this paper extends to the
use of machine learning with big data for asset allocation and asset
pricing (Chen, Pelger, & Zhu, 2021; Giglio, Kelly, & Xiu, 2022; Gu, Kelly,
& Xiu, 2020; Leippold, Wang, & Zhou, 2022; Pinelis & Ruppert, 2022;
Ma, Leong, & Jiang, 2023). As prime examples, Pinelis and Ruppert
(2022) introduce a utility maximizing investment strategy using random
forest to predict the return and volatility. Our study takes the lead to
introduce the TF model as the pricing model. Compared with traditional
machine learning models, the TF model has the characteristics of
self-concentration, which increases its prediction performance. Second,
we introduce the interpretable mechanism to the deep learning frame
and show the economic grounds of the machine learning-based pricing
model. Recent studies highlight the interpretable machine learning
model on the financial market (e.g., Fan, Ke, Liao, & Neuhierl, 2022;
Kapetanios & Kempf, 2022). We extend the research to use SHAP values
to visualize feature contributions and find the different feature impor
tance in return and volatility prediction. Finally, our paper echoes asset
pricing studies and extends the economic explanations of machine
learning in the Chinese stock market (Leippold et al., 2022; Liu et al.,
2019; Ma, Liao, & Jiang, 2023).
The rest of this paper is organized as follows. Section 2 describes the Fig. 1. TF model.
data and method, Section 3 presents the empirical findings, Section 4 This figure shows the internal structure of our constructed TF model, which
delineates the economic ground, and Section 5 summarizes the full consists of 1 embedding layer, several encoder and hidden layers, and 1 output
paper. layer. Where the input data are n features of P stocks.
1
We tabulate the hyperparameters setting of our models in appendix with
table OA3.
2
Fig. 2. Encoder structure.

This figure shows the encoder structure, which contains three building blocks,
which are multiheaded attention modules, two kernel sizes 1 in front of the
convolution, and two remaining connections after each module.
Fig. 4. Multihead attention.

This image illustrates the computation of the multiheaded attention mecha
nism. First, h different sets of linear projections are obtained to transform the
queries, keys and values. Then, these h sets of transformed queries, keys, and
values are pooled in parallel. Finally, the output of these h attention pooling
layers is stitched together and transformed by another linear projection that can
be learned to produce the final output.
the equal query and key dk embedding dimensionality. The output is

then used to weight the same sequence entries, named “values” (V). Q,
K, and V are first subjected to a linear transformation and then input to
the scaling dot product attention. Note that there will be h times. In fact,
it is the so-called multihead, and each time it is counted as one head.
Moreover, the parameters W for each linear transformation of Q, K, and
V are different. Then, the scaled dot product attention results of h times
are stitched together, and the values obtained by conducting the linear
transformation are used as the results of the multiheaded attention. It
can be seen that the difference of the multiheaded attention proposed by
Google is that h calculations are performed not just once. Overall, the
superiority of multiple-headed attention is that it broadens the atten
tional model’s ability to attend to information at different times and
provides multiple “representational subspaces” at the attention layer.
Fig. 3. Scaled dot-product attention.
The matrix of outputs and the framework are shown in the formula, and
This image illustrates the computational process of scaled dot- the multihead attention function is as follows:
product attention. ( T)
QK
Attention(Q, K, V) = soft max √̅̅̅̅̅ V. (4)
dk
represents the progression of time and can be used for capturing non
periodic patterns in the input that depend on time. The multihead attention function is as follows:
Multi Head(Q, K, V) = Concat(head1 , …, headn )WO (5)

2.1.2. Encoder layer
Each encoder layer contains three building blocks: (1) a multihead where
attention module, (2) two convolutions with one kernel, and (3) two ( )
residual connections after each of the previous blocks. headi = Attention QWiQ , KWiK ,VWiV (6)
i. Multihead Attention Mechanism
The essence of multihead is multiple independent attention calcu where WiQ ∈ Rdmodel ×dk , Wik ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv , WO ∈ Rdmodel ×hdv
lations, which serve as an integrated function to prevent overfitting. are parameter matrices. In this work, we employ h = 3. parallel atten
Multiple queries are used to calculate multiple pieces of information tion layers, or heads. For each of these, we use dv = dk = 64, dmodel =
from the input information in parallel. Each attention focuses on a 64.
different part of the input information. Within each attention module, an ii&iii. Addition and Normalization & Two Convolutions
entry of a sequence, named “query” (Q), is compared to all other The multihead attention mechanism passes through the add and
sequence entries, named “keys” (K) by a scaled dot product, scaled by norm layers to standardize data and constrain the ambiguity caused by
3
each vector of z. Moreover, considering that the multihead attention characteristic indicators: valuation and growth, investment, earnings,
mechanism may not fit complex processes well enough, the encoder is inertia, trading friction and intangible assets.3 Additionally, we also
enhanced by adding two layers of convolutions of kernel size 1. calculate the lag returns and volatilities with 1 and 2 months considering
the autocorrelation in variables (Pinelis & Ruppert, 2022).
2.1.3. Output layer To investigate whether macro factors affect asset allocation, we
First, the features passing through the encoder layer are subjected to select an additional 8 macro factors to add to our data set, resulting in a
global average pooling (GAP), which flattens the remaining dimensions total of 80 variables in our data set. For the market portfolio used as a
to reshape the tensor. After that, these features are then sent to two fully risk asset in our paper, we assign different weights to each stock based
connected layers to output the predicted results. on its market size, thus constructing a portfolio whose β is 1, which
The first dense layer is as follows: means that its systematic risk is equal to the systematic risk of the
( ( )′ ) market.
Zkl = g bl− 1 + Z l− 1 W l− 1 , (7)
where g(•) is the nonlinear “activation function” to take the aggre 2.3. Asset allocation
gated signal from the previous layer and send it to the next layer.
We apply the rectified linear unit (ReLU) as the nonlinear active When we weight assets between risky and risk-free assets, we use a
function: conditional asset allocation strategy model, which can be presented by
the following equation:
ReLU(Zk ) = max(Zk , 0). (8)
[ ]
The final output is a linear transformation of the last dense layer E rt − rt−f 1 |Ft− 1
wt = (10)
output: γ⋅σ [rt |Ft− 1 ]
2
( )′
D(Z, b, W) = bl− 1 + Z l− 1 W l− 1 . (9) where rt represents the expected value-weighted market return in month
f
t with machine learning, rt− 1 represents the risk-free yield in month t-1,
We apply the first dense layer with 64 neurons. Moreover, to avoid f
overfitting, we add one dropout layer disabling a portion of neurons and we define rt− 1as the one-year Treasury bond yields. The parameter γ
between the two dense layers with a 0.1 dropout rate. is assumed to be positive and reflects the risk aversion content. We use 3
The gradient descent algorithm is generally used to optimize the as the risk aversion content in our asset allocation (Pinelis & Ruppert,
objective function. For a given function L(θ), the algorithm minimizes 2022). σ2 is the expected volatility of stock returns at time t. wt is the
L(θ) by updating θ along the first-order derivative of the function, i.e., optimal weight we invest in risk assets.
the opposite direction of the gradient, i.e., θ = θ − η∇ θ L(θ), where η is
the set iteration step, and we use the stochastic gradient descent (SGD) 3. Empirical results
method to optimize the TF model.
In our main empirical analysis, we use the TF model to predict two 3.1. Out-of-sample R2 and MSFE
things: the first is to predict the value-weighted market portfolio’s return
based on firm-specific characteristics and macro factors, and the second We first examine the forecasting power of the TF model with
is to apply the same method to predict the portfolio’s volatility. To different numbers of encoders, along with other machine learning
pursue the economic grounds of the TF model and robust analysis, we models. To characterize the predictive power of the different models, we
also consider the prediction on characteristic-managed portfolios and construct an out-of-simple R2 . The out-of-sample R2 for returns is
industry portfolios in Section 4. calculated as
For comparison, we include principal component analysis (PCA), T (
∑
N ∑ )2
elastic net (Enet), random forest (RF) and long short-term memory with ri,t − ̂
ri,t
three layers (LSTM), with their algorithms displayed in Bali, Goyal, R2 = 1 − i=1 t=1
, (11)
Huang, Jiang, and Wen (2022).2 In particular, LSTM has a similar
∑N ∑
T
ri,t 2
attention unit as the TF model, named the memory unit, which is a good i=1 t=1
comparison for our new mechanism.

where ri,t and r̂
i,t are the actual and forecast return values of the stock and
We follow the standard approach in the literature for hyper
parameter selection and model estimation. We divide our 20-year portfolio for each period, respectively. The range of values of R2
sample (i.e., 2000–2019) into three disjointed time periods that main is ( − ∞, 1]. Higher values indicate better model prediction ability, with
tain the temporal ordering of the data, containing 7 years of training R2 equal to 1 when the model completely predicts stock returns for each
sample (2000–2006), 3 years of validation sample (2007–2009), and the period. Meanwhile, we give the calculation results of the mean square
remaining 10 years (2010–2019) for out-of-sample testing. We refit our forecasting error (MSFE) of each model. The R2 and MSFE for the
models at the beginning of every year, and each time we increase the volatility models are computed in the same way.
training sample by one year. We maintain the same size of the validation
1∑ T
1 ∑ Nt
( )2
sample but roll it forward to include the most recent twelve months of MSFE = ri,t − r̂
i,t (12)
T i=1 Nt t=1
data.
Table 1 reports the results. It is found that for R2 of return fore
2.2. Data casting, the ordinary least regression (LR) performs the worst at
− 11.73%; PCA outperforms within the linear models at 0.80%. TF,
This paper selects all stock return data of the China A-share market whose self-attention mechanism can accept all vectors as input at the
from January 2000 to December 2019, and the data source is the CSMAR same time and long-range information will not be weakened, out
Database. For the large database of characteristic factors, this paper performs the random forest model with different numbers of encoders,
constructs six categories based on Jiang et al. (2018) with 68 firm where three encoders have the highest R2 of 2.10%. All the mean square
2 3
For details in our applications, we tabulate the hyperparameters setting of We report the description and construction of each characteristic in ap
these models in appendix with table OA3. pendix with table OA1 and OA2.
4
Table 1 Wang, 2018; Engle, Ghysels, & Sohn, 2013). However, compared to the
Out-of-sample R2 and MSFE for TF models. macroeconomic attention indices (MAI) constructed by Fisher et al.
Model R2 (%) MSFE (%) (2022), among eight categories, our results show that only Bd, GDP, and
CPI have a significant impact on stock volatility as monetary policy,
Return
LR − 11.73 2.14
output growth, and inflation, respectively.
EN 0.71 2.05
PCA 0.80 2.01 3.3. Asset allocation with the transformer model
LSTM 0.96 1.98
RF 1.00 2.00
This paragraph discusses the performance of out-of-sample in
TF(E = 1) 1.89 1.86
TF(E = 2) 1.49 1.88 vestments calibrated by machine learning on a risk-adjusted basis and
TF(E = 3) 2.10 1.90 provides a relevant comparison. We invest one dollar as an investor in
TF(E = 4) 2.09 1.90 early 2010 while specifying a risk aversion factor of 3 and plot the cu
TF(E = 5) 1.74 1.85 mulative returns to each strategy in Figs. 5 and 6 without short selling in
Volatility
LR 87.85 0.055
100% and 150% leverage constraints, respectively.
EN 92.21 0.027 As shown in Fig. 6, the final December 2019 returns corresponding to
PCA 91.74 0.033 the RF, EN, LR, LSTM, PCA, and TF with encoders from 1 to 5 are $4.39,
LSTM 89.86 0.032 $2.93, $2.47, $2.62, $2.52, $4.14, $4.68, $4.57, $4.90 and $4.53,
RF 88.29 0.030
respectively. Compared to the EN, LR, LSTM and PCA models, the RF
TF(E = 1) 97.87 0.035
TF(E = 2) 98.27 0.029 and TF models captured the 2015 market expansion well, with the TF
TF(E = 3) 98.36 0.026 models whose encoders are 2, 3 and 4 ranking in the top three highest
TF(E = 4) 98.29 0.027 points of all models in 2015. At the same time, in the first two gray areas
TF(E = 5) 96.77 0.054 representing rapid declines, the downward trend of the LR, PCA, LSTM
This table presents the R2 and MSFE of the OLS model and other machine and EN models is relatively flat, while the downtrend of the RF and TF
learning models compared to the TF with different numbers of encoder layers models is relatively steep, indicating that the volatility of LR, PCA, LSTM
(E), from one to five. and EN is relatively small overall. In the third gray area, LR has the
largest descent slope and the worst performance. As shown in Fig. 7, the
RF and TF models exhibit a higher slope at each stage of earnings growth
forecasting errors of the TF models are under 2.00% compared with the than the LR, EN, and PCA models, indicating that they are more accurate
other models. in predicting future trends.
For volatility forecasting, the R2 of the models are relatively high, all Overall, the TF model outperforms the other models. The traditional
above 87%, representing that they have excellent predictive skills, while machine learning models only perform well when the training and test
TF models have the best prediction performance exceeding 95%，as samples are of the same type, and the self-attention mechanism in TF can
well as the lowest mean square forecasting error with 3 encoders of capture the dynamic changes in data and generate more stable
0.026%. predictability.
3.4. The Sharpe ratio and the max downgrade

3.2. Feature economic importance
Table 2 reports the Sharpe ratio and max downgrade of each model.
To explain which characteristic is important and to what extent it Machine learning outperforms the linear regression model on a risk-
affects the model, we apply SHAP values proposed by Lundberg and Lee adjusted basis for the out-of-sample period. Reward-risk timing with
(2017) based on a unification of ideas from game theory. SHAP values TF (encoder = 4), using TF for both conditional return and volatility
attribute μi to each feature and can be formulated as follows: estimates, gives the highest Sharpe ratio of 2.75 and 3.06 in 100%
leverage and 150% leverage, respectively, which is a 139% and 194%
∑ |S|!(Z − |S| − 1 )! [ ]
μi = fS∪{i} (S ∪ {i} ) − fS (S) , (13) increase from the linear regression. The Sharpe ratios for TF models with
S∈F\{i}
|Z|! encoders from 1 to 5 are higher than those for models other than the RF
model, suggesting that the TF model can outperform most machine
where Z is the set of all features, S is the subsets of Z, F define the set of learning models even when it is not trained to its best.
all features, S is the all possible subsets of F\{i}, fS∪{i} represents a model Considering the maximum downside of these strategies, when vola
trained with that feature i and Z, and model fS is trained without the tility is high, the TF strategy takes on relatively more risk. The maximum
feature i. Then, predictions from the two models are compared on the retracement of the TF model is greater than the maximum retracement
current input fS∪{i} (S ∪ {i} ) − fS (S). The larger the value of μ, the more of the other models, which confirms in two external ways that taking
important the firm characteristics or macro characteristics are. higher risk is correlated with obtaining higher returns.
In Fig. 5, we show the average factor’s contribution to the return
predictions and volatility predictions. The vertical axis of the graph sorts 3.5. Investment utility
the features by the sum of the SHAP values of all samples, and the
horizontal axis is the SHAP value. Each point represents a sample, the While investors usually take risk and return into account when
sample size is stacked vertically, and the colors indicate the feature investing, for long-term investments, we also consider the level of
values (red corresponds to high values, blue to low values). investor satisfaction, that is, utility, at each time. If two investments
Gross margins, earnings yield and debt-to-equity ratio have the most provide the same ultimate return to the investor, one providing a fixed
significant effect on the return predicting process, which is consistent return in each period and the other providing only a single return in the
with the findings of Leippold et al. (2022). Hanauer, Kononova, and final period, then the former will clearly provide more utility to the
Rapp (2022) obtained similar results after analyzing the determinants of investor than the latter. To calculate the utility of month t, we take the
the European stock market using the SHAP model. GDP, CPI and Bond following equation (Pinelis & Ruppert, 2022):
yield play an important role in volatility forecasting as macro features.
W1−t γ − 1
The impact of increased GDP output, especially industrial output, on U(Wt ) = . (14)
1− γ
stock volatility is well documented in the literature (Abbas, McMillan, &
5
Fig. 5. SHAP values for return factors and volatility factors.

This figure depicts the Top 20 mean contribution of each predictor to the prediction of stock returns and volatility, and from left to right, the encoder layers of the TF
model are represented from 1 to 5. The features are sorted on the y-axis based on the sum of their absolute SHAP values across all samples, while the x-axis represents
the SHAP value. Each dot represents a specific sample, with color indicating the value of the feature (red for high values, blue for low values). Dots located to the left
of the central axis (where the SHAP value is zero) suggest that the corresponding feature has a negative impact on stock returns (SHAP value is negative), while dots
located to the right suggest a positive impact. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of
this article.)
Table 3 contains the average annual realized utilities, certainty- 3.6. Transaction cost
equivalent (CE) yields computed as the inverse utility function of the
average realized utility, and terminal wealth for different lever Transaction cost is always considered in real investment. In this
coefficients. section, we investigate the asset allocation results after deducting a
The linear regression model exhibits the worst utility at 0.229 and proportional transaction cost of 20, 40, 60 and 80 basis points. Table 4
0.209 for both the 100% leverage limit and the 150% leverage limit, reports the results with different leverage ratios and transaction costs.
respectively. The TF model strategy (E = 4) exhibits the best utility and We find that in terms of absolute transaction costs, the absolute
highest CE yield of 0.314 and 5.19% when the leverage is 100%, with RF transaction costs of a machine learning model with a leverage of 1 were
model strategy performance second, with a utility of 0.310 and CE yield smaller than the absolute transaction costs of a machine learning model
of 5.19%. At a 150% leverage limit, the TF model (E = 4) also achieves with a leverage of 1.5. On the one hand, the initial principal of the
the best utility and highest CE yield of 0.327 and 5.21%, with the TF machine learning model with a leverage ratio of 1 is relatively small, and
model (E = 3) and TF model (E = 5) ranking second and third, respec on the other hand, the double leverage ratio limits the ordinary position
tively. This suggests that the model that gives investors the highest adjustment of funds. In terms of relative transaction costs, the relative
utility may also differ for different leverage limits. The RF model and the transaction cost of a machine learning model with a 150% leverage limit
TF model perform similarly over a long period of time with no leverage is smaller than that of a machine learning model with a 100% leverage
restrictions, while the opposite is true with 150% leverage restrictions. limit, which shows that when we capture a high-yield opportunity, we
However, the TF model ultimately shows a higher final return. can choose to bear the cost of position adjustment to obtain higher
returns. In addition, the linear regression model has the largest change
in relative transaction cost with basis point, indicating that the other
6
Fig. 6. Cumulative returns of reward-risk timing to linear regression (100% leverage).

This figure plots the cumulative returns for the portfolio of each model from 2010 to 2019 with 100% leverage. We include linear regression (LR) as the benchmark.
The initial investment of 1 buck.
Fig. 7. Cumulative returns of reward-risk timing to linear regression (150% leverage limitation).
This figure plots the cumulative returns for the portfolio of each model from 2010 to 2019 with 150% leverage. We include linear regression (LR) as the benchmark.
The initial investment of 1 buck.
models have a smaller adjustment compared to the linear model. Among more stable.
the TF models, those with an encoder equal to 3 have the smallest
change in transaction cost with the basis point, and their investments are
7
Table 2 Table 4
Sharpe ratios and max downgrades. Transaction costs of portfolio allocation.
Strategy Annual Standard Sharp Max Downgrade Strategy Terminal Wealth Sharpe ratio
Return (%) Deviation (%) Ratio (%)
Leverage =1
Leverage = 1 Transaction 20 40 60 80 20 40 60 80
LR 9.46 4.56 1.36 19.3 cost bps bps bps bps bps bps bps bps
EN 11.3 3.80 2.13 20.2 LR 2.34 2.21 2.10 1.99 1.23 1.10 0.97 0.85
PCA 9.68 3.70 1.73 21.2 EN 2.88 2.83 2.79 2.74 2.09 2.04 1.99 1.94
LSTM 10.15 3.62 1.90 22.8 PCA 2.47 2.42 2.37 2.32 1.68 1.61 1.55 1.49
RF 15.9 5.09 2.49 35.1 LSTM 2.62 2.60 2.60 2.58 1.90 1.88 1.88 1.86
TF(E = RF 4.26 4.13 4.01 3.89 2.42 2.35 2.29 2.22
15.2 5.06 2.37 47.0
1) TF(E = 1) 3.99 3.84 3.69 3.54 2.29 2.20 2.11 2.02
TF(E = TF(E = 2) 4.52 4.36 4.20 4.04 2.58 2.49 2.40 2.31
16.7 5.06 2.65 52.3
2) TF(E = 3) 4.45 4.33 4.22 4.11 2.51 2.45 2.39 2.33
TF(E = TF(E = 4) 4.71 4.52 4.33 4.14 2.67 2.56 2.45 2,34
16.4 5.12 2.57 35.3
3) TF(E = 5) 4.39 4.25 4.11 3.97 2.33 2.25 2.17 2.09
TF(E =
17.2 5.07 2.75 54.0
4)
TF(E =
16.3 5.46 2.39 51.3 Leverage =1.5
5)
Leverage = 1.5 Transaction 20 40 60 80 20 40 60 80
LR 8.97 5.09 1.12 26.2 cost bps bps bps bps bps bps bps bps
EN 11.6 3.83 2.18 32.4
PCA 9.68 3.70 1.73 29.6 LR 2.22 2.09 1.97 1.85 1.00 0.87 0.74 0.61
LSTM 10.1 3.62 1.91 29.6 EN 2.96 2.91 2.86 2.81 2.14 2.09 2.04 1.99
RF 20.0 6.40 2.62 67.3 PCA 2.47 2.42 2.37 2.32 1.68 1.61 1.55 1.49
TF(E = LSTM 2.60 2.58 2.56 2.54 1.88 1.86 1.84 1.82
19.9 6.36 2.62 64.9 RF 6.01 5.79 5.59 5.39 2.56 2.49 2.43 2.36
1)
TF(E = TF(E = 1) 5.94 5.71 5.48 5.25 2.55 2.48 2.41 2.34
22.7 6.60 2.96 71.2 TF(E = 2) 7.51 7.23 6.95 6.67 2.89 2.82 2.75 2.68
2)
TF(E = TF(E = 3) 7.21 7.00 6.80 6.61 2.90 2.85 2.79 2.74
22.1 6.40 2.95 70.3 TF(E = 4) 7.60 7.29 6.98 6.67 3.00 2.93 2.86 2.79
3)
TF(E = TF(E = 5) 6.96 6.69 6.42 6.15 2.63 2.56 2.49 2.42
22.9 6.43 3.06 71.9
4)
This table reports the impact of transaction costs on the monthly return (in %)
TF(E =
21.8 6.90 2.70 69.6 and the annualized Sharpe ratio of the portfolio strategies based on different
5)
machine learning algorithms.
This table shows the out-of-sample annual returns, standard deviations, Sharpe
ratios and max downgrades for the test period from 2010 to 2019 for the trading
level samples to seek the economic ground and check the model’s
rules.
robustness.
Table 3 4.1. Analysis based on size and value portfolio

Average realized utilities.
We obtain six portfolios based on the market capitalization and book-
Strategy Utility CE yield (%) Terminal Wealth
to-market ratio of individual stocks following Fama and French (1993).
Leverage = 1 For these portfolios, we use the TF model to train and forecast the
LR 0.229 5.10 2.46
number of encoder layers from one to five to obtain the predicted returns
EN 0.278 5.16 2.92
PCA 0.253 5.13 2.51 of each portfolio from 2010 to 2019. Using the predicted results for asset
LSTM 0.268 5.15 2.62 allocation, we obtain the true monthly return of the portfolio and further
RF 0.310 5.19 4.38 construct the regression equations:
TF(E = 1) 0.276 5.15 4.14 ( )
TF(E = 2) 0.302 5.18 4.68 Ri = α + β1 RM − Rf + β2 SMB + β3 HML (FF3) (15)
TF(E = 3) 0.303 5.18 4.57
TF(E = 4) 0.314 5.19 4.90 ( )
TF(E = 5) 0.297 5.18 4.53
Ri = α + β1 RM − Rf + β2 SMB + β3 HML + β4 RMW + β5 CMA.(FF5)
Leverage = 1.5 (16)
LR 0.209 5.08 2.36
EN 0.280 5.16 3.00 where Ri refers to the expected returns of portfolio i with each model;
PCA 0.253 5.13 2.52 ( )
RM − Rf refers to the market risk premium; SMB is the size factor; HML
LSTM 0.268 5.15 2.62
RF 0.320 5.20 6.22 is the value factor; RMW is the profitability factor; and CMA is the in
TF(E = 1) 0.290 5.17 6.17 vestment factor.
TF(E = 2) 0.317 5.20 7.79 Table 5 tabulates that all alphas of Eqs. (15) and (16) in the regres
TF(E = 3) 0.321 5.20 7.42
sion are significant, suggesting that the returns obtained under the TF
TF(E = 4) 0.327 5.21 7.91
TF(E = 5) 0.323 5.21 7.23 model cannot be fully explained by FF3 or FF5. In addition, as the
market size increases and book-to-market ratio decreases, the alpha
In this table, the average annual realized utilities, annual CE yields, and terminal
becomes more pronounced. The result is consistent with the finding of
wealth for each strategy are shown under lever coefficients 1 and 1.5 for the
Leippold et al. (2022) that using fundamental and macro signals with a
2010 to 2019 out-of-sample period.
self-attention mechanism, the TF model shows better performance on
large and value firms than other subsamples and improves market
4. Economic grounds of the transformer model and robustness
efficiency.
analysis
In this section, we extend our model to several common portfolio-
8
Table 5 4.2. The performance in characteristic-managed portfolios

Excess returns and t values of the model in FF3 and FF5.
FF3- t(FF3- Relative FF5- t(FF5- Relative This section tests whether the zero-intercept no-arbitrage restriction
α(%) α) benefits α(%) α) benefits is satisfied in the prediction process (Gu et al., 2020). We focus this
SIZE analysis on unconditional pricing errors, defined as:
TF ( ) ( ) ( )
(E 0.62 1.67* 2.05 0.66 1.70* 3.06
αi := E ui,t = E ri,t − E ̂r i,t (17)
= 1)
TF where αi is the forecast error, which is the expectation of the monthly
(E 0.63 1.69* 2.18 0.67 1.73* 3.54 forecast error of the characteristic-managed portfolios with 72 charac
= 2) ( )
teristics in our data sample, E ri,t is the real return in each month t, and
TF ( )
S (E 0.67 1.78* 2.29 0.71 1.81* 3.82 E ̂r i,t is the expectation of return. Fig. 8 scatters the estimated out-of-
= 3) sample prediction errors for each model against the average returns of
TF
managed portfolio xt. The figure also reports the number of alphas whose
(E 0.66 1.72* 2.28 0.69 1.74* 3.67
= 4) t-statistics exceed 3.0. The overall magnitude of alphas shrinks as we
TF move from the LSTM model to the TF model.
(E 0.63 1.67* 2.21 0.66 1.68* 3.53 The results show that the scatter points of the LSTM model are almost
= 5) all on the y = x line, which shows that the LSTM model has almost no
TF
predictive effect. The scatter points of the TF model are all under the
(E 1.00 2.76*** 1.51 1.03 2.73*** 2.04
= 1) straight line, and most of the points are close to the straight line of y = 0,
TF indicating that the prediction results of the model are good. In addition,
(E 0.98 2.68*** 1.61 1.01 2.65*** 2.92 compared with the LSTM model passing the t-test only 38 times and 34
= 2)
factor features are significant, the number of TF models passing the t-test
TF
M (E 1.03 2.80*** 1.62 1.05 2.75*** 2.95
is as high as 68 and only 4 factor features are significant, which is a
= 3) significant increase, and we can almost conclude that the prediction
TF error of the TF model is 0.
(E 1.00 2.69*** 1.6 1.02 2.64*** 2.47
= 4)
TF 4.3. Intraindustry vs. interindustry predictability
(E 0.97 2.65*** 1.47 1.00 2.62*** 1.92
= 5) Firms within the same industry are highly correlated, and they al
TF ways bear similar returns stemming from common sources (Seavey,
(E 1.20 4.54*** 0.9 1.22 4.45*** 1.21
Imhof, & Watanabe, 2016), such as technological shocks, regulatory
= 1)
TF environment, and industry-specific demand and supply for products and
(E 1.16 4.62*** 0.88 1.20 4.61*** 1.26 services. Our prior findings suggest that deep learning signals can pre
= 2) dict future returns at the stock and portfolio levels, and matching similar
TF
firms within the same industry would further provide a natural frame
B (E 1.30 4.65*** 0.85 1.33 4.56*** 1.28
= 3)
work to control for firm fundamentals and understand the sources of
TF return predictability.
(E 1.16 4.58*** 0.89 1.19 4.50*** 1.34 To pursue the analysis, we follow Avramov, Cheng, and Metzker
= 4) (2022) and implement an unconditional trading strategy based on the
TF
(E 1.25 4.55*** 0.78 1.28 4.47*** 0.88
machine learning predicted return of stock i in month t, denoted by ̂r i,t .
= 5) In every month t, we take a long position on stocks that are expected to
confront higher returns than the market average, i.e., ̂r i,t − ̂r m,t > 0,
defined as the winner portfolio (W), and take a short position on stocks
BM ratio that are expected to have lower returns than the market average, i.e.,
̂r i,t − ̂r m,t < 0, defined as the loser portfolio (L). ̂r m,t refers to the equal
H TF(E = 1) 0.37 1.77* 1.28 0.40 1.84* 1.4
TF(E = 2) 0.36 1.79* 1.35 0.40 1.94* 1.51 weighted average of ̂r i,t across all stocks in the market, that is, ̂r m,t =
TF(E = 3) 0.43 1.96** 1.52 0.46 2.03** 2.08 1
∑Nt
N r i,t , and Nt refers to the number of stocks in the market. We hold
i=1 ̂
TF(E = 4) 0.34 1.73* 1.4 0.38 1.85* 1.61 t
TF(E = 5) 0.38 1.75* 1.42 0.41 1.82* 1.72 the portfolios over the next month. Then, we analyze the realized return
M TF(E = 1) 0.94 3.39*** 1.1 0.99 3.45*** 1.45 spread between the return portfolio and safety portfolio, denoted as
TF(E = 2) 0.90 3.33*** 1.16 0.94 3.35*** 1.66 WML.
TF(E = 3) 0.97 3.48*** 1.23 1.02 3.53*** 1.75
TF(E = 4) 0.94 3.34*** 1.28 0.98 3.37*** 2.11 1 ∑Nt ( )
TF(E = 5) 0.94 3.38*** 1.2 0.99 3.43*** 1.68
WMLt+1 = ̂r i,t − ̂r m,t ri,t+1 (18)
Ht i=1
L TF(E = 1) 1.97 5.84*** 0.71 1.99 5.66*** 1.2
TF(E = 2) 1.90 5.70*** 0.67 1.92 5.51*** 1.14 1∑Nt
TF(E = 3) 1.99 5.86*** 0.78 2.01 5.69*** 1.57 Ht = ∣̂r i,t − ̂r m,t ∣ (19)
TF(E = 4) 2.02 5.80*** 0.79 2.04 5.63*** 1.61
2 i=1
TF(E = 5) 1.90 5.67*** 0.63 1.91 5.47*** 0.96 The weight of each stock is proportional to the stock’s model-
This table shows the excess returns and their t values for different portfolios in predicted return on a market-adjusted basis, with higher weights for
the FF3 (Fama & French, 1993) and FF5 (Fama & French, 2015) at the encoder higher return stocks in the long leg and more negative weights for lower
level of the TF model from 1 to 5. The portfolios are divided by 3:4:3 according return stocks in the short leg. The result is scaled by the inverse of the
to the size of the market size to book-to-market ratio. S1 denotes the portfolio of sum of absolute deviations of stock returns from the market average for
small stocks with 1 encoder layer. Relative returns are a concrete embodiment of standardization.
the content of Fig. 9. ***, **, and * indicate significance at the 1%, 5%, and 10% Next, following Hameed and Mian (2015) and Avramov et al. (2022),
levels, respectively.
we decompose the return spread into two components. WML can be
rewritten as:
9
Fig. 8. Out-of-sample prediction across models.

The figure reports out-of-sample prediction errors (alphas) for 72 characteristic-managed portfolios xt relative to the LSTM model and TF model (encoder layer from 1
to 5). Alphas with t-statistics in excess of 3.0 are shown in red dots, while insignificant alphas are shown in hollow squares. The red dots not shown in the figure are
too close to the scatter points (9.46,7.70). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
1 ∑Nt ( ) industry average, i.e., ̂r i,t − ̂r j,t < 0. The second term represents the
WMLt+1 = ̂r i,t − ̂r ind,t + ̂r ind,t − ̂r m,t ri,t+1
Ht i=1 interindustry return spread that takes a long position on industries that
1 ∑Nt ( ) 1 ∑Nt ( ) are expected to confront higher returns than the market average, i.e.,
= ̂r i,t − ̂r ind,t ri,t+1 + ̂r ind− j,t − ̂r m,t ri,t+1 (20) ̂r j,t − ̂r m,t > 0 , and takes a short position on industries that are expected
Ht i=1 Ht i=1
to retain lower returns than the market average, i.e., ̂r j,t − ̂r m,t < 0.
1 ∑Nt ( ) 1 ∑Lt ( )
Following the same method, we also construct the risky minus safety
= ̂r i,t − ̂r ind,t ri,t+1 + ̂r ind− j,t − ̂r m,t Nj,t rj,t+1
Ht Ht portfolios (RMS) considering the volatility forecasting process, that is,
i=1 j=1
where ̂r ind− j,t refers to the equal-weighted average of ̂r i,t across all stocks with higher risk than the market average, i.e., vol ̂ i,t − vol
̂ m,t > 0,
∑Nj,t
stocks in industry j. That is, ̂r j,t = N1t i=1 ̂r i,t , where Nj,t refers to the defined as risk portfolio (R), and stocks that are expected to have lower
number of stocks in industry j. Lt refers to the number of industries, and crash risk than the market average, i.e., vol̂ i,t − vol
̂ m,t < 0, defined as
rj,t+1 refers to the equal-weighted average of stock returns across all safety portfolio (S).
stocks in industry j in montht + 1. The results are tabulated in Table 6, and several findings are worth
The first term in Eq. (20) represents the intraindustry return spread noting. First, all machine learning methods perform better in intra
that takes a long position on stocks that are expected to confront higher industry samples than in interindustry samples. For instance, the return
returns than the industry average, i.e., ̂r i,t − ̂r j,t > 0, and takes a short spread of the intraindustry WML portfolio accounts for 90% of the total
position on stocks that are expected to retain lower returns than the risk spread in WML using the TF3 method (0.038 out of 0.042), which
10
Table 6
Transformer model attribution in intra- and interindustry.
LSTM TF1 TF2 TF3 TF4 TF5
Panel A: Return
0.065** 0.056** 0.070*** 0.052*** 0.063***
Winner 0.078(1.37)
(2.45) (2.46) (2.94) (2.55) (2.60)
0.030 0.030 0.024 0.028 0.024 0.027
Loser
(1.33) (1.33) (1.31) (1.52) (1.30) (1.50)
0.037*** 0.033*** 0.042*** 0.028*** 0.036***
WML 0.029(1.00)
(3.06) (3.83) (3.88) (3.64) (2.92)
0.032*** 0.030*** 0.038*** 0.026*** 0.032***
WML (intraindustry) 0.028(0.96)
(3.11) (3.63) (3.90) (3.67) (2.93)
0.004 0.004 0.004 0.002 0.005
WML (interindustry) 0.001(0.02)
(0.39) (0.43) (0.38) (0.30) (0.42)
Panel B: Volitality
0.169*** 0.167*** 0.168*** 0.169*** 0.173*** 0.169***
Risk
(20.82) (20.53) (19.82) (20.42) (20.86) (20.82)
0.104*** 0.104*** 0.105*** 0.107*** 0.104*** 0.104***
Safety
(23.79) (23.79) (27.24) (26.69) (26.75) (28.27)
0.065*** 0.063*** 0.063*** 0.062*** 0.069*** 0.065***
RMS
(6.49) (6.28) (6.06) (6.03) (6.83) (6.50)
0.055*** 0.053*** 0.053** 0.052*** 0.058*** 0.055***
RMS (intraindustry)
(6.18) (6.21) (6.01) (5.93) (6.61) (6.18)
0.011 0.01 0.01 0.01 0.011 0.011
RMS (interindustry)
(1.23) (1.18) (1.10) (1.11) (1.29) (1.23)
This table reports volatility and return in risk (loser) portfolios and safety (winner) portfolios, as well as return and volatility spreads in WML (RMS) portfolios and their
decompositions in terms of intra- and interindustry. Panels A and B show the results in terms of the return and volatility measure, respectively. ***, **, and * indicate
significance at the 1%, 5%, and 10% levels, respectively.
illustrates that our method pays more attention to selecting stocks than Acknowledgements
industry rotation. Second, the TF model has better performance than
LSTM in WML portfolios. While the value of RMS is significant at the 1% The authors are grateful for the very constructive comments and
level within all methods, the value of WML is significant only in the TF suggestions from the editor, anonymous reviewers, Chunmin Zhang,
method. Benjian Wu, Xuejun Zhang, Fuwei Jiang, Zhanyu Ying, and the seminar
participants at Minzu University of China.
5. Conclusion
Appendix A. Supplementary data
This paper extends the application of the Transformer (TF) model by
Vaswani et al. (2017) in the Chinese stock market. We compare the Supplementary data to this article can be found online at https://doi.
performance of the TF model and other machine learning models in the org/10.1016/j.irfa.2023.102876.
stock market, delve deeper into the factors that influence stock forecasts
and compare the performance of different subsamples. References
Our findings suggest that a significant portion of Chinese stock
returns can be explained by the TF pricing model. In terms of return and Abbas, G., McMillan, D. G., & Wang, S. (2018). Conditional volatility nexus between
volatility prediction, TF models can show extremely high R2 compared stock markets and macroeconomic variables empirical evidence of G-7 countries.
Journal of Economic Studies, 45(1), 77–99.
to traditional machine learning models. The TF model performs signif Avramov, D., Cheng, S., & Metzker, L. (2022). Machine learning vs. economic
icantly better in asset allocation than other traditional machine learning restrictions: Evidence from stock return predictability. Management Science, 0(0).
models over the out-of-sample and subsamples. The TF model has per Bali, T. G., Goyal, A., Huang, D., Jiang, F., & Wen, Q. (2022). Predicting corporate bond
returns: Merton meets machine learning. In Georgetown McDonough School of Business
fect performance in measuring various indicators, such as the Sharpe Research Paper (3686164) (pp. 20–110).
ratio and investor utility. After considering the transaction costs, the TF Chen, L., Pelger, M., & Zhu, J. (2021). Deep learning in asset pricing. Available from:
model can also obtain sufficient benefits, indicating that the TF model https://doi.org/10.2139/ssrn.3350138.
Engle, R. F., Ghysels, E., & Sohn, B. (2013). Stock market volatility and macroeconomic
can be effectively applied in practice. In addition, in the comparison of fundamentals. Review of Economics and Statistics, 95(3), 776–797.
subsamples, the TF model shows a predicted return closer to the real Fabian, H., & Marcel, P. (2022). Managing the market portfolio. Management Science, 0
return, which shows its strong predictive ability. (0).
Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and
We also provide economic explanations for deep learning with the bonds. Journal of Financial Economics, 33, 3–56.
SHAP value. In return prediction, fundamental characteristics play a Fama, E. F., & French, K. R. (2015). A five-factor asset allocation model. Journal of
more important role, while macroeconomic features show more Financial Economics, 116, 1–22.
Fan, J., Ke, Z. T., Liao, Y., & Neuhierl, A. (2022). Structural deep learning in conditional
importance in volatility forecasting. The TF model is more inclined to
asset pricing. Available from: SSRN 4117882.
predict return and volatility within industries than between industries. Ferson, W. E., Siegel, A. F., & Wang, J. L. (2019). Asymptotic variances for tests of portfolio
Overall, our paper sheds light on the surging literature on asset efficiency and factor model comparisons with conditioning information. University of
allocation in the big data era and has great implications for the efficiency Washington Working Paper.
Fisher, A., Martineau, C., & Sheng, J. (2022). Macroeconomic attention and
of financial markets in emerging markets. Meanwhile, our exploration of announcement risk Premia. Review of Financial Studies, 35(11), 5057–5093.
the TF model is still in the beginning stages, and the decoder mechanism Giglio, S., Kelly, B., & Xiu, D. (2022). Factor models, machine learning, and asset pricing.
can be further improved and broadly adopted in financial studies. For Annual Review of Financial Economics, 14, 337–368.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. The
example, Gerald Woo, Liu, Sahoo, Kumar, and Hoi (2022) show that the Review of Financial Studies, 33(5), 2223–2273.
TF variant ETSformer model has a good effect on the smooth analysis of Hameed, A., & Mian, G. M. (2015). Industries and stock return reversals. Journal of
time series analysis. Financial and Quantitative Analysis, 50, 89–117.
Hanauer, M. X., Kononova, M., & Rapp, M. S. (2022). Boosting agnostic fundamental
analysis: Using machine learning to identify mispricing in European stock markets.
Finance Research Letters, 48, Article 102856.
11
Jiang, F., Tang, G., & Zhou, G. (2018). Firm characteristics and Chinese stocks. Journal of Ma, T., Leong, W. J., & Jiang, F. (2023). A latent factor model for the Chinese stock
Management Science and Engineering, 3(4), 259–283. market. International Review of Financial Analysis, 87, 102555.
Kanndel, S., & Stambaugh, R. F. (1996). On the predictability of stock returns: An asset, Ma, T., Liao, C., & Jiang, F. (2023). Timing the factor zoo via deep learning: Evidence from
allocation perspective. The Journal of Finance, 51, 385–424. China. Accounting & Finance, 63, 485–505.
Kapetanios, G., & Kempf, F. (2022). Interpretable machine learning modeling for asset pricing Marquering, W., & Verbeek, M. (2004). The Economic Value of Predicting Stock Index
(Working paper). Returns and Volatility. Journal of Financial and Quantitative Analysis, 39(2), 407–429.
Kazemi, S. M., Goel, R., Eghbali, S., Ramanan, J., Sahota, J., Thakur, S., et al. (2019). Moreira, A., & Muir, T. (2017). Volatility-managed portfolios. The. Journal of Finance, 69
Time2Vec: Learning a vector representation of time. Learning a vector representation of (2), 1611–1644.
time. Available from: 10.48550/arXiv.1907.05321. Pinelis, M., & Ruppert, D. (2022). Machine learning portfolio allocation. The Journal of
Leippold, M., Wang, Q., & Zhou, W. (2022). Machine learning in the Chinese stock Finance and Data Science, 8, 35–54.
market. Journal of Financial Economics, 145(2), 64–82. Seavey, S. E., Imhof, M., & Watanabe, O. V. (2016). Proprietary costs of competition and
Lin, T., Wang, Y., Liu, X., et al. (2021). A survey of TFs. Available from: https://www. financial statement comparability. University of Nebraska-Lincoln Working Paper.
arxiv.org/abs/2106.04554v2. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017).
Liu, J., Stambaugh, R. F., & Yuan, Y. (2019). Size and value in China. Journal of Financial Attention is all you need. Available from: 10.48550/arXiv.1706.03762.
Economics, 134, 48–69. Woo, G., Liu, C., Sahoo, D., Kumar, A., & Hoi, S. (2022). Etsformer: Exponential smoothing
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model transformers for time series forecasting. Available from: 10.48550/arXiv.2202.01381.
predictions. Neural Information Processing Systems, 30.
12

Attention Is All You Need - Transformer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attention Is All You Need - Transformer

Uploaded by

Copyright:

Available Formats

International Review of Financial Analysis 90 (2023) 102876

Contents lists available at ScienceDirect

International Review of Financial Analysis

Attention is all you need: An interpretable transformer-based asset

Fig. 2. Encoder structure.

Fig. 4. Multihead attention.

the equal query and key dk embedding dimensionality. The output is

Multi Head(Q, K, V) = Concat(head1 , …, headn )WO (5)

comparison for our new mechanism.

3.4. The Sharpe ratio and the max downgrade

Fig. 5. SHAP values for return factors and volatility factors.

Fig. 6. Cumulative returns of reward-risk timing to linear regression (100% leverage).

Table 3 4.1. Analysis based on size and value portfolio

In this section, we extend our model to several common portfolio-

Table 5 4.2. The performance in characteristic-managed portfolios

Fig. 8. Out-of-sample prediction across models.

You might also like