Professional Documents
Culture Documents
‡
Jiang∗
*
Likelihood Technology
†
University of California, Irvine
yaoqil1@uci.edu deniselll877@gmail.com
‡
Sun Yat-sen University
{chenxy628, yangjm25, caojh7, jiangkk3}@mail2.sysu.edu.cn
2019.08.25
Contents
1 INTRODUCTION 3
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 METHODOLOGY 4
2.1 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Basic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Advantages of LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Structure of LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 LSTM Model Construction . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Basic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 CNN Model Construction . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 CNN Framework Proposed . . . . . . . . . . . . . . . . . . . . . . . . 7
3 EXPERIMENT 8
3.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.2 Time Period of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
3.1.3 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.4 Label Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.5 Division of Training Set and Test Set . . . . . . . . . . . . . . . . . . 12
3.2 Experiment of the LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Hyperparameter Selection for Loss Function and Optimizer . . . . . . 12
3.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Keep Probability of the Dropout Layer . . . . . . . . . . . . . . . . . 15
3.2.4 Training the LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Experiment of the CNN Model . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 The Difference and Improvements from Framework by E. Hoseinzade
and S. Haratizadeh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Comparison with DNN . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Regularization-SpatialDropout . . . . . . . . . . . . . . . . . . . . . . 19
3.3.4 Input Data of Five Days VS. One Day . . . . . . . . . . . . . . . . . 20
3.3.5 Selection of Final Framework . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Make Stock Purchase Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 CONCLUSION 22
4.1 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2
1 INTRODUCTION
Training a practical and effective model for stock selection has been a greatly concerned
problem in the field of artificial intelligence. Because of the uncertainty and sensitivity of
the finance market, there are many factors which may influence the stock price such as
significant events, society’s economic condition or political turmoils. Many scholars have
been applied various methods of machine learning to find a fitting model for the stock price
time-series data with nonlinearities, discontinuities, and high-frequency multi-polynomial
components. [4] To handle these complicated components and make a precise prediction, a
lot of scholars choose to use machine learning to create a model.
1.2 Background
Even though some of the models from previous works have been achieved good per-
formance in the U.S. market by using low-frequency data and features, training a suitable
model with high-frequency stock data is still a problem worth exploring. There is another
challenge for stock trading novices without experience in this field. Although there are ex-
isting low-frequency features created by some experts, constructing useful high-frequency
features with high-level information is difficult for us. Moreover, many existing features
which are calculated by the U.S. stock market index different from the China stock market.
Therefore, we prefer to apply methods without constructing features by ourselves. In this
paper, we introduce two machine learning algorithms LSTM (long short-term memory) and
CNN (convolutional neural network) to find the most beneficial strategy of stock trading in
China stock market.
1.3 Summary
Through the high-frequency price data of the past period (one day or several days),
we construct two models which can predict the expected return rate of the stock on the
day, and select the stock with the highest expected yield at the opening to maximize the
total return. The organization of the rest paper is as follows: Section 2 presents general
concepts, advantages and detailed constructions for the two models and our preparation of
processing data. Section 3 describes hyperparameter selection, regularization and how we
train our models. Section 4 compares the accuracy, thermodynamic diagram and the result
3
of backtesting to analyze their advantages and weakness. At last, we summarize our work
and indicate several future attempts.
2 METHODOLOGY
2.1 Long Short-Term Memory
One of a special kind of RNN(Recurrent Neural Network) is called Long Short-Term
Memor(LSTM) which was first proposed by Hochreiter and Schmidhuber in 1997. [5] Gers et
al. advanced its structure by introducing forget gate in 2000. [3] after a few years, H.Sak et
al. and W.Zaremba et al. improved the framework of LSTM and added more details. [11,13]
Nowadays, LSTM is widely used in the field of natural language processing and emotion
analysis.
4
RNN because not only can it converge quickly, also it can solve the problem of vanishing
gradient and exploding gradient. Moreover, considering the feature that LSTM can maintain
long-term memory, it is sensitive to the long-term dependency of data.
sinh x ex − e−x
tanh(x) = = x
cosh x e + e−x
The weight matrix W and the bias b will be generated by Back propagation algorithm.
5
2.1.4 LSTM Model Construction
Here, we use xit to represent the feature vector of stock i at time t. We treat a time
series {xit } (t = 1, 2, . . . , T ) as an instance to be the input of our LSTM network which
can predict the profit y of the next day. To consider this problem reasonably, we make this
problem to be a classification problem. Hence, we create a classification model depended on
LSTM network in Figure 3. The LSTM layer of this model has been introduced carefully in
the previous section. In the dropout layer, each cell has a certain probability to stop working
in each training process to avoid over-fitting of the model. Therefore, when predicting, all
neurons participate in the work to maximize the role of the model. Earlier experiments
have confirmed that the addition of this layer has a good performance on regularization. [5]
The Dense layer has no activation function and outputs the score of the probability that the
predicted instance belongs to each category. The score is then converted to the corresponding
probability by the softmax layer. In this process, softmax layer use the equation as follows:
ei
Pi = P
j ej
In this equation, Pi and i represent the probability and score respectedly that the predicted
instance belongs to certain category. The category with the highest probability is regarded
as the prediction made by the model on the instance category.
6
last layer of the output layer is a classifier that can be classified using logistic regression or
SVM(support vector machine) or other methods.
Fully Connected Layers In these layers, the 16×40 data generated by the CNN layer are
flattened into a final feature vector. This feature vector is then converted to a final prediction
through 2 fully connected layers (2 hidden layers). And the output layer has 4 neurons, with
SoftMax function utilized as the activation function, intended to give the probability of a
big rise, a small rise, a big drop, a small drop of a stock respectively, and then calculate the
7
expectation of its daily return today to make stock purchase advice. The whole framework
is demonstrated in Figure 4
3 EXPERIMENT
3.1 Data Processing
3.1.1 Sources of Data
We have chosen the closing price, opening price, highest price, lowest price, trading
volume, transaction amount, number of transactions, commission ratio, volume ratio, com-
mission purchase, commission sale of the China A-share market. These 11 volume-price
features are used as elements to describe the state of the stock. In order to train with
market-represented stocks and reduce data inconsistency (such as stock suspension) and
noise, we selected the constituents of the CSI300 Index, denoted as I, as the source of the
sample data set. The model uses two types of data, every 15 minutes of data and every 120
minutes of data. The sample data is exhibited as in Table 2 and 3.
8
Figure 1: An unfolded Recurrent Neural Network
9
Figure 2: Basic Cell of Long Short Term Memory
10
Figure 5: Trends of the CSI300 Index from July, 2014 to December, 2018
As for sample set, we choose the data in Figure 5 from July 1,2014 to December 31,
2018, denoted as M , during which Chinese stock market witnessed periods of sharp rise,
sharp fall, slight rise and slight fall, providing sufficient samples for each of our four labels.
Since our ability of computation is insufficient, when we adjusted the model, we only used
data from January,2019 to May,2019, denoted as N .
x = (x1 , x2 , . . . , x11 )
xk − min{xk }
t∈T
if max{xk } − min{xk } =
6 0
max{xk } − min{xk }
t∈T t∈T
x0k = t∈T t∈T
0 if max{xk } − min{xk } = 0
t∈T t∈T
0
x = (x01 , x02 , . . . , x011 )
∀i ∈ I, t ∈ T, k = 1, 2, . . . , 11
11
Table 1: Labels selected
Labels Fall Sharply Fall Slightly Rise Slightly Rise Sharply
Yields Interval −10% ∼ −1.06% −1.06% ∼ +0.04% +0.04% ∼ +1.32% +1.32% ∼ +10%
09:456 9.72 9.74 9.79 9.72 2.38 × 106 2.32 × 107 1327 1.77 1.43 1.45 × 108 5.93 × 107
10:00 9.61 9.73 9.73 9.60 3.63 × 106 3.51 × 107 1716 2.30 1.81 1.16 × 108 4.85 × 107
. . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. ..
15:00 9.70 9.70 9.71 9.68 2.21 × 106 2.14 × 107 728 0.21 0.89 8.61 × 107 1.54 × 108
1 5
number of transactions commission sale
2 6
commission ratio Chinese stock market opening time:
3
volume ratio 9:30∼11:30 and 13:00∼15:00
4
commission purchase
In this equation, m is the total number of samples; n is the total number of categories;
yij represents the real category (only 0 or 1), and ŷij is the probability of the corresponding
category output by the softmax layer.
For the optimizer, we selected Adam, Adadelta, and RMSProp three adaptive optimizers
for testing (learning rate is 0.001). The performance of different optimizers is represented in
Figure 6. The test uses batch gradient descent method: there are 30 samples per batch and
all samples do 50 iterations. Also, all the following tests are the same.
12
Figure 6: The Accuracy of Different Optimizer in Test Set and Train Set
13
Table 3: Data Example 2 (Stock: SH600000 Date: Jan 2th , 2019)
Time Close Open Highest Lowest Volume Amount NOT CR VR CP CS
11:30 9.67 9.74 9.79 9.58 1.37 × 107 1.32 × 108 7429 1.75 1.03 1.13 × 109 4.23 × 108
15:00 9.70 9.70 9.71 9.68 2.21 × 106 2.14 × 107 728 0.21 0.89 8.61 × 107 1.54 × 108
We use the early stopping to avoid overfitting at the 30th epoch. The Confusion matrix
thermograms of the three models which are normalized by index and are represented in
Figure 7.
As can be seen from the graph, the Adadelta optimizer is inferior, and Adam optimizer
and the RMSProp optimizer are equally effective. Therefore, according to this result, we
chose Adam optimizer in our model.
3.2.2 Regularization
Figure 8: The Accuracy of Different Regularization in Test Set and Train Set
Regarding the regularization method, we compared the performance of the same data
using the dropout layer and L2 regularization (weight decay). L2 regularization adds a
regular term to the loss function:
λ X 2
C = C0 + w
2n w
14
Figure 9: The Confusion Matrix Thermogram for different hyperparameter
Depend on the result showed in the graph, we can conclude that although L2 regulariza-
tion can effectively avoid overfitting, using the dropout layer can make the model converge to
a better solution fast. Hence, our model decides to use the dropout layer for regularization.
Figure 10: The Accuracy of Different Keep Parameter in Test Set and Train Set
The confusion matrix thermogram for the three models at the 30th epoch ,with using
the early stopping to avoid overfitting, normolized by index, are represented in Figure 11: It
can be seen from the graph that when keep probability is 0.8, it is better than 0.7, 0.9. The
accuracy of the model prediction is higher, so we choose keep probability=0.8.
15
Figure 11: The Confusion Matrix Thermogram for the Three Models
several days before. Therefore, under the same network structure, we input two different
forms of data, and train two prediction models, regarded as model S and model L, that have
a different emphasis on the stock market approach. And then we combine the prediction
results of the two models to get the final output. The first type chooses the input data as
the time series of the price data per 15 minutes of the previous day, 240 minutes in total,
same as 16 steps. The training model predicts the next day’s return y, and the second uses
the price per 120 minutes in the first 10 days. The data with 20 steps predicts the next day’s
return y.
The training results are as follows:
It can be seen from the Figure 12 that both models are valid compared to the random
selection of 25% accuracy. And the model L is better predicted than the model S. The
confusion matrix thermodynamics of the two models at the 20th epoch are in Figure 13,
which are normalized by columns.
The brighter areas of the image are concentrated near the diagonal, which again demon-
strates the validity of the model. The areas in the lower right and the upper left corner are
the brightest, which indicates that the model has a more accurate prediction of the situation
of sharply rise and fall, specifically, the precision ratio is higher. The darker area in the
middle indicates that the model does not distinguish between small rises and small falls.
Overall, we use the early stop strategy to set the checkpoint to use the two models at the
20th epoch as the final model to predict.
16
Figure 13: The Confusion Matrix Thermogram for Model S(left) and Model L(right)
17
aggregating the information in consecutive time periods. So, convolution kernels of 3×1 is
utilized, which generate new durational features containing information from 3 consecutive
time periods. Such design is inspired by the observation that most of the famous candle-stick
patterns like Three Line Strike and Three Black Crows try to find meaningful patterns in
three consecutive days. The third layer is a pooling layer that performs a 2×1 max pooling,
that is a very common setting for the pooling layers. Next, the third convolution layer is
similar to the second one, intended to further extract durational features. It is followed
by another same max pooling layer and a fully connected layer. In the very beginning, we
chose to apply similar framework in our task, using 2 convolution layers to extract high-level
features and durational features respectively, followed by a max pooling layer and finally a
fully connected layer.
However, during experiments, we find that the average pooling layer performs better
than a max pooling one. Better performance means higher accuracy rate in the test set
in this section and those below. In Figure 15, we find that the removal of pooling layer
leads to even better performance. Therefore, we assume that pooling layer is not suitable
for our task, omitting too much information. So, we remove the pooling layer. After that,
we find that the removal of the second convolution layer also leads to better performance.
Our assumption is as follows: The idea of such layer is inspired by candle-stick patterns like
Three Line Strike. It is a good idea to combine information of past consecutive 3 days to
predict daily return today. But it should be noticed here that the time span of each time
period that is combined with others should match that of the time period being predicted.
For instance, it may be effective to combine information of 3 consecutive 15-minute periods
to predict the price direction in the next 15-minute period. But such effectiveness might
not be valid when the whole daily return of today is being predicted. So, we remove the
second convolution layer. So far, there remain one convolution layer and a fully connected
layer in our semi-finished framework. After experiment, we discover that adding another
fully connected layer lead to better performance. Such change increases the depth of the
framework and thus better capture the information hidden in the data. However, we also find
that increasing the number of fully connected layers to 3 or more does not lead to obvious
improvement. So, we decide to employ 2 fully connected layers, the number of neurons being
250 and 100 respectively and ReLU function as activation function.
18
3.3.2 Comparison with DNN
We compare our convolution + 2 fully connected layers framework with a framework
with only 2 fully connected layers, and the result is in favor of ours. In Figure 16, this
demonstrates that it is effective to use one convolution layer to extract high-level features.
3.3.3 Regularization-SpatialDropout
In our CNN framework, besides commonly used dropout method in fully connected lay-
ers, our regularization methods also include SpatialDropout, a dropout method designed for
CNN. Ordinary dropout method randomly chooses some components of input data without
any fixed spatial patterns and turn them to zero, while SpatialDropout randomly chooses
some rows or columns of input data and turn them to zero. According to Figure 17, such
method is proved to be effective in image identification.
In our framework, when training our model, in each batch we employ such method to
randomly choose 3 columns in input data and turn them to zero. That means we drop 3 out
of 11 primary price-volume features and use 8 randomly left features to construct high-level
features in each batch of the training set. Through this method, the model can learn to
use different combinations of 8 primary features to construct high-level ones. It enhances
19
CNN’s ability to extract different features. In addition, the dropout methods themselves can
be comprehend as a low-cost ensemble strategy, whose essence is similar to random forest.
Specific to our task, SpatialDropout forces the framework to utilize randomly left features to
approach the best model. In this way, each primary feature is supposed to make contribution
to the final model, which prevents our model from placing extra emphasis on some primary
features in the training set, and thus, from overfitting.
After accomplishing improvements mentioned above, the accuracy on the test set is still
not satisfying. We assume that the input data of one day is not sufficient to make predictions
on the daily return today. Therefore, we consider elongating the time span on input data.
Finally, we decide to 15-minute price-volume data in 5 past consecutive days to predict the
daily return today. In Figure 18, the experiment result shows that it indeed improves the
accuracy on the test set. We think this is because such elongation enables the model to
better recognize the trends of features, which contributes to better predictions.
20
Figure 19: The Accuracy of CNN Model
21
the model not only predicts the classification of the sample but also want it can comprehen-
sively consider the benefits and risks to directly give stock purchase advice. To achieve this
purpose, we use the formula as follows:
1 X
E= p(S ∈ {Class k}) w̄(Class k)
2 model 1,2
S represents the input sample, and w̄(Class k) represents the mean of all the yields belonging
to category k in sample J, which we use to represent the expected rate of return for each
category. E represents the final prediction of the sample yield. The formula takes into
account the possible rise and fall of the sample and the predictions of the two models.
After sorting the final predicted rate of return, the higher the top-ranked stock, the more
recommended it is.
4 CONCLUSION
4.1 Result Analysis
Finally, we connect the LSTM model and the CNN model to the backtesting framework,
using the CSI300 Index as the baseline, and simulate trading from January 16, 2019 to May
31,2019. The specific trading operation is as follows: regarded I as the stock pool, we trade
according to the purchase proposal given by the model— considering the transaction fee,
we only consider buying no more than 20 stocks with an predicted profit of 0.14% or more.
The funds are allocated on average, and the portfolio are changed daily. Our results are
represented in Figure 21, 22 and Table 4.
It can be seen that both models have obvious outperform baselines regardless of the
handling fee, which explains their effectiveness from another perspective. The model slightly
outperforms the baseline when considering the handling fee. And the curves of the two
models are more intense than the baseline changes, which means that they tend to make
more aggressive decisions when trading. Therefore, it can be concluded that our model is
effective in dealing with stock return prediction with high frequency primary price-volume
data.
22
Figure 22: The Backtesting Result without Transaction Fee
23
minutes. Because of insufficient computing power, we did not use data per 1 minute or 5
minute. These higher frequency data can capture subtle information and tendency in stock
market more precisely. Therefore, in the future, we can use price-volume features per 1
minute as our dataset to obtain more primary features and improve prediction accuracy.
Besides, because in our case we trade stocks everyday, it will generate a lot of unneces-
sary transaction fees. Considering the expense of transaction fees, we want to apply reinforce
learning to maximize future profits. Moreover, it is easy to construct the environment of the
complicated and sensitive stock market without constructing features by ourselves. Hence,
reinforce learning is a feasible attempt in the future.
4.3 Conclusion
This paper applies neural network of deep learning to construct two models of LSTM
and CNN to forecast the expected return rate of the stock today, and to maximize the total
return by adapting an appropriate strategy. We analyze the performances of LSTM and
CNN and verify their effectiveness and rationalities for the application of forecasting stock
prices. Although, these two models have overcame some difficulties, there is still a possibility
of advancement such as avoiding unnecessary transaction fees. In our later works, we will
focus on solving these problems.
4.4 Acknowledgement
We would like to express sincere appreciations to Maxwell Liu from ShingingMidas
Private Fund, Xingyu Fu from Sun Yat-sen University for their generous guidance throughout
the project. Also, we are grateful to Kangkang Jiang from Sun Yat-sen University for his
assistance all the way. Without their supports, we cannot complish such a challenging task.
24
References
[1] Kai Chen, Yi Zhou, and Fangyan Dai. A lstm-based method for stock returns prediction:
A case study of china stock market. In 2015 IEEE International Conference on Big Data
(Big Data), pages 2823–2824. IEEE, 2015.
[2] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. Deep direct
reinforcement learning for financial signal representation and trading. IEEE transactions
on neural networks and learning systems, 28(3):653–664, 2016.
[3] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual
prediction with lstm. 1999.
[4] Esmaeil Hadavandi, Hassan Shavandi, and Arash Ghanbari. Integration of genetic fuzzy
systems and artificial neural networks for stock price forecasting. Knowledge-Based
Systems, 23(8):800–808, 2010.
[5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa-
tion, 9(8):1735–1780, 1997.
[6] Ehsan Hoseinzade and Saman Haratizadeh. Cnnpred: Cnn-based stock market predic-
tion using a diverse set of variables. Expert Systems with Applications, 129:273–285,
2019.
[7] Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recognition: A
convolutional neural-network approach. IEEE transactions on neural networks, 8(1):98–
113, 1997.
[8] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and
time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
[9] Hailin Li, Cihan H Dagli, and David Enke. Short-term stock market timing prediction
under reinforcement learning schemes. In 2007 IEEE International Symposium on Ap-
proximate Dynamic Programming and Reinforcement Learning, pages 233–240. IEEE,
2007.
[11] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent
neural network architectures for large scale acoustic modeling. In Fifteenth annual
conference of the international speech communication association, 2014.
[12] Shunrong Shen, Haomiao Jiang, and Tongda Zhang. Stock market forecasting using
machine learning algorithms. Department of Electrical Engineering, Stanford University,
Stanford, CA, pages 1–5, 2012.
[13] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regu-
larization. arXiv preprint arXiv:1409.2329, 2014.
25