Short-Term Stock Price Analysis Based On Order Book Information

683

原著論文 Original Paper
Short-term Stock Price Analysis Based on Order Book

Information
Kenichi Yoshida Graduate School of Business Science, University of Tsukuba, Japan
yoshida@gssm.otsuka.tsukuba.ac.jp
Akito Sakurai Graduate School of Science and Technology, Keio University, Japan
sakurai@ae.keio.ac.jp
keywords: efficient market hypothesis, high-frequency trading, time series analysis, stock market prediction, technical
analysis
Summary
Efficient market hypothesis is widely accepted in financial market studies and entails the unpredictability of
future stock prices. In this study, we show that a simple analysis can classify short-term stock price changes with
an 82.9% accuracy. Our analysis uses the order book information of high-frequency trading. The volume of high-
frequency trading, which is responsible for short-term stock price changes, is increasing dramatically; therefore,
our study suggests the importance of analyzing short-term market fluctuations, an aspect that is not well studied in
conventional market theories. The experimental results also suggest the importance of the new data representation and
analysis methods we propose, neither of which have been thoroughly investigated in conventional financial studies.
1. Introduction financial studies. However, the results were not pleasing.

This indicated that we would have to extract information
Efficient market hypothesis (EMH [Basu 77]) is widely from UDP packets by designing new data representation
accepted in financial market studies. It asserts that finan- methods. The poor prediction ability of conventional tech-
cial markets are “informationally efficient.” As a conse- niques suggests the importance of a new approach toward
quence, future stock prices change randomly. In other data representation and analysis techniques.
words, they are unpredictable [Malkiel 99]. Chapter 2 of this paper first surveys and analyzes related
In this study, we show that a simple analysis can predict work to determine the limitations of existing approaches
short-term stock prices with an 82.9% accuracy. The vol- to clarify the motivation of this research. Chapter 3 ex-
ume of high-frequency trading (HFT [Chlistalla 11]) is in- plains the proposed data representation and the analytical
creasing dramatically; therefore, an analysis of short-term method. Chapter 4 reports on the experimental results.
stock returns is increasingly important. Our results sug- Chapter 5 discusses related topics. Finally, Chapter 6 con-
gest the importance of analyzing the short-term market by cludes our findings.
showing unexpected prediction accuracy.
Our analysis uses order book information from the Tokyo 2. Related Works
stock exchange (TSE, the third largest stock exchange in
the world [Tokyo stock exchange 12]) from Jan. 2010 to EMH asserts that future stock prices are difficult to pre-
May 2014. The total amount of data analyzed was approx- dict. However, some researchers have challenged this state-
imately 6 terabytes. The volume of order book informa- ment (e.g., [Huang 94, Gallagher 02, Chordia 04, Gilbert
tion clearly reflects the increase in HFT. Although these 10, Gay Jr 11, Hsu 10, Menkhoff 10, Deng 15]). For ex-
trading records are neither small nor easy to analyze, re- ample, [Huang 94] and [Chordia 04] analyzed relation-
cent developments in high-performance computing (i.e., ship between individual stock returns and order imbal-
“big data” processing), enables an analysis of this magni- ance. [Huang 94] extracted prediction rules, and [Chor-
tude and complexity to be performed. dia 04] made regression model. However, none of them
During our study, we attempted to use common tech- reported the prediction model for HFT with comparative
niques and data representations used in most conventional accuracy reported in this paper. The recent use of social
684 人工知能学会論文誌 30 巻 5 号 F（2015 年）
network services (SNSs, e.g., Twitter) data also has been Arrownet
studied (e.g., [Bollen 11, Sprenger 13, Sul 14, Deng 14]). User
TSE Router Router
Studies based on SNS data seem to analyze the occurrence System
of information asymmetry. These studies analyze the in- UDP packet

formation propagation process through SNSs and found ID Stock ID Ask/Bid Info. Ask/Bid Info. ...
situations in which information asymmetry exists. Infor-
mation symmetry is an important assumption of EMH;
therefore, a prediction ability in such a situation is not sur- Time Ask Size Ask Price Bid Size Bid Price
prising.
Fig. 1 Information in Arrownet UDP packets
In our study, we aim to find situations in which infor-

Table 1 A sample order book
mation asymmetry exists in HFT. The order book is the
source of information from which we try to determine the
asymmetry. The order book contains information about Ask Size Price Bid Size
ask and bid orders for each stock at a certain price. Each ... ...
order represents the price that each trader recognizes as 800 2026
being adequate. In HFT, each trader experiences a dif- 600 2015
ferent time lag when processing information. Even with 300 2000
high-speed computers and with a high-speed network con- 1965 100
nection, such asynchrony is inevitable. With an adequate 1955 200
data representation, we try to analyze the effects of such 1950 1800
asynchrony. ... ...
As the amount of HFT is increasing, there has been a

3. Order Book and its Representation
proliferation of primers and introductory handbooks on
3·1 Order book in Arrownet
this topic (such as [Lewis 14] and [Patterson 13]). How-
The TSE has been distributing real-time stock exchange
ever, there still exists room for a comprehensive study of
information through a high-speed trading network named
the characteristics of HFT. For example, to the best our
“Arrownet” since Jan. 2010 [Tokyo stock exchange 12].
knowledge, there is no simple report on the predictabil-
When the TSE processes an order, it sends a UDP packet
ity of stock returns through HFT. EMH might discourage
that contains information about the order, i.e., the stock
such studies. Current studies analyze different aspects,
ID, size of order, and ask/bid price. In particular, each
such as liquidity that formed the subject of [Beltran-Lopez
UDP packet contains the eight lowest ask orders and eight
12] and [Danielsson 12].
highest bid orders (See Figure 1). The packet can therefore
be described as a list of orders that records the interest of
In conventional studies of the financial market, meth- buyers and sellers on a particular stock, and is essentially
ods from statistics and signal processing, such as the auto- an order book for the stock (see Table 1 for example). A
regressive (AR) model [Champernowne 48], the auto-regressive comparison of the current packet with its predecessor for
moving average (ARMA) model [Box 13], and the auto- the same stock enables the extraction of the information of
regressive integrated moving average (ARIMA) model [Box the corresponding order.
13] are used. These methods model time series data with Figure 2 compares the data representations of orders.
a simple combination of elementary functions. For exam- As explained before, each UDP packet corresponds to a
ple, the AR model represents a sequence of events over single order. Figure 2 (1) shows the actual order events
time as yt = φyt−1 + wt . This equation relates the value in real world trading. In the real world, each order event
assumed by y at time t to another variable wt and to the occurs at a random interval. On the contrary, conven-
value y at the previous time. Here, t represents a moment tional methods use fixed intervals, as shown in Figure 2
in time based on a fixed time interval. Our experimental (2), which shows a typical time series representation of
results show the inappropriateness of these conventional the same order events. Because a fixed interval is used to
approaches, and suggest the importance of the proposed represent data, the timing information of events is lost in
new data representation and analysis methods. this representation.
Short-term Stock Price Analysis Based on Order Book Information 685
(1) Arrownet (Real World Trade Status) 0.8

AVE vs. AM
B1 B1 AVE vs. PM
A1 B2 0.6
A4 B3
A2 Quoted price
A3 A3
0.4
Correlation Coefficient
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
Time
0.2
(2) Traditional Time Series Representaion
Q1 Q1 Q1 Q1 Q1 Q1 Q1 Quoted price 0
Q2 Q2 Q2
-0.2
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
Time
(3) Event-based Representaion 1 -0.4
E3 E9
E1 E5 -0.6
0 100 200 300 400 500 600 700
E6 E7
E2 Quoted price Sorted Order of Correlation Coefficients
E4 E8
Fig. 3 Correlation coefficients between AV Eam and returns

(4) Event-based Representaion 2
e1
e2
Quoted price
ries data has a poor prediction capability, whereas the pro-
Order Book Information
Time Quoted Price Number of Asks Number of Bids ...
posed representation has superior prediction abilities. We
will report the results of the preliminary experiments be-
Fig. 2 Representation of order events low.
To represent information stored in the event-based rep-
resentation, we first designed the following index AV Eam .
Although the conventional time series representation en-
16
ables mathematical analyses of quoted price changes, its (Price Leveli (t)−Quoted Price(t))∗(Size of Price Leveli (t)))
AV E t = i=1 16
(Size of Price Leveli (t))
use of a fixed time interval disregards the timing infor- i=1
mation of events. In addition, in most cases, it does not throughout morning

AV E am = t at which an order is processed AV Et
distinguish between ask and bid orders, but only repre-
sents the quoted price of the stock. In the example shown AV Eam simply sums up the weighted difference between
in Figure 2 (1), the interval between events A2 and B1 is ask/bid levels and the most recent quoted price of the stock
short, but the interval between events B1 and A3 is much at the time when an order is processed. We tried to ex-
longer. Such information is lost in the example shown in tract the interest of buyer and seller on the quoted price by
Figure 2 (2). Furthermore, ask and bid orders are also not summing up AV Et stored in each UDP packet, i.e., each
distinguished in the example in Figure 2 (2). order. Thereby, the price of the eight lowest ask orders
To represent the information that is disregarded in con- (i.e., ask price levels) and the eight highest bid orders (16
ventional time series representations, we have designed orders in total) are extracted from each UDP packet.
two event-based representations (Figure 2 (3) and (4)).
Figure 3 shows the Spearman’s rank correlation coeffi-
Event-based representation 1 (Figure 2 (3)) represents the
cients between AV Eam and the corresponding stock re-
order book information at the moment when each order
turns. The coefficients were calculated by first selecting
event is processed. In particular, it represents the stock
the 100 most frequently traded stocks during the period
ID of the event, the most recently quoted price: ask/bid
from 2010 to 2012 (3 years or 680 days). Then, we calcu-
size/price. Other information included in our design to
lated the difference (dam ) between the first quoted price of
improve the analysis is also stored (See Section 4·1 for
the morning period and the last quoted price of the morn-
details). Event-based representation 2 (Figure 2 (4)) repre-
ing period for each stock. Finally, we calculated Spear-
sents the same information as event-based representation
man’s rank correlation coefficient between AV Eam and
1, but only the one at the time at which a trade of the stock
dam each day. “AVE vs. AM” in the figure shows the re-
takes place.
sults. Here, the y-axis shows Spearman’s rank correlation
coefficients. The coefficients are sorted according to the
3·2 Preliminary experiments values. As shown in the figure, the data is weakly corre-
We conducted several preliminary experiments which lated.
suggested that the conventional representation of time se- However, this weak correlation disappears when we cal-
culate the coefficient between AV Eam and the stock re- 0.0001
With AVE
turns for the afternoon period (i.e., the difference between Without AVE
the last quoted price of the morning period and the last
Squared Correlation Coefficient

0.001
quoted price of the day). The plot “AVE vs. PM” in Fig-
ure 3 shows the coefficients.
Although the non-existence of a correlation between 0.01
AV Eam and the afternoon returns suggests the impossi-

bility to predict returns even in half a day, the existence of
0.1
a weak correlation between AV Eam and the morning re-
turns suggests the possibility that short-term returns may
be predicted. If AV Eam did not have any relation with 1
0 100 200 300 400 500 600 700
stock returns, the difference between “AV Eam vs. AM”
Sorted Order of Squard Correlation Coefficients
and “AV Eam vs. PM” would lack reason for its occur-
rence. The observed difference seems to suggest the exis- Fig. 4 SVR performance
tence of a short-term correlation.
The next step in our study was to determine whether the 4. Experimental Results using Event-based
weak correlations are utilized by the conventional repre- Representation
sentation, i.e, fixed interval representation of quoted prices.
We accomplished this by building the following linear re- 4·1 Event-based representation and Attributes
gression model learned by support vector regression (SVR): This chapter reports the results of experiments that use
the TSE order book information from Jan. 2010 to May
2014. We analyzed the UDP packet information that was
Quoted Pricet+10 = wT Φ(xt ) + b + t exchanged through the high-speed trading network named
“Arrownet” [Tokyo stock exchange 12]. The total amount
of data that was analyzed was approximately 6 terabytes.
Here xt is the vector of Quoted Pricei and AV Ei (i = Event-based representations (Figure 2 (3) and (4)) were
t, t − 1, . . . , t − 30). The interval is one minute. Figure 4 used in the analysis.
shows the squared correlation coefficients of the true and The analysis used libSVM [Chang 11] and J48 (an im-
regressed Quoted Pricet , which are sorted in increasing plementation of C4.5 in weka [Hall 09]). The default pa-
order. “With AVE” corresponds to the regression with rameters were used in the experiments to solve the three
AV Ei s in explanatory variables and “Without AVE” cor- class problem (“up,” “down,” and “unchanged”). The at-
responds to the regression without AV Ei s. tributes that were used are:
Average price in orders: e1
Although the weak correlations found in Figure 3 sug-
As explained before, AV Et is the difference between
gest the existence of a short-term correlation, Figure 4
the most recent quoted price and the average price of
shows that the SVR model failed to reproduce it.
most recent orders. e1 is the sum of AV Et between
We summarize the results of the preliminary experiments the time of the most recent trade and the time of the
as follows. When the SVR model was trained to predict current order.
future quoted prices using the time series data from the Number of asks and bids: e2 , e3 , e4
previous 30 min by using the representation of conven- Number of UDP packets for “ask (e2 )” and “bid (e3 )”
tional time series analysis, the SVR model was unable to after the most recent trade. Here, e4 = e2 − e3 .
confirm the existence of a weak short-term relationship, Number of lower asks and higher bids: e5 , e6 , e7
even though the correlation analysis shown in Figure 3 Number of UDP packets that ask with price lower
suggests that such a relationship exists. This led us to con- than the most recent quoted price (e5 ), and number
clude that the conventional time series data representation of UDP packets that bid with price higher than the
is responsible for causing the SVR to perform poorly by most recent quoted price (e6 ). e7 = e5 − e6 .
removing important information from the data. The event- Number of lower bids and higher asks: e8 , e9 , e10
based representation shown in Figure 2 was inspired by Number of UDP packets that ask with price higher
this observation. than the most recent quoted price (e8 ), and number of
UDP packets that bid with price lower than the most
recent quoted price (e9 ). e10 = e8 − e9 .
ZeroR 49.8%
Number of asks and bids with quoted price: e11 , e12 , e13
Number of UDP packets for ask and bid with the e_14 58.7%
same price as the most recent quoted price after the

corresponding trade (e11 , e12 ). e13 = e11 − e12 . SVM_R1 60.7%
Differences between the recent quoted price J48_R1 61.9%

and its predecessor: e14
This may either be the latest difference or a series of SVM_R2 74.5%
the values up to the four most recent differences.

J48_R2 79.2%
As explained above, e1 is designed to capture the average
of peoples’ perception of the stock price. The variables
0 20 40 60 80 100
e2 , e3 , ..., e13 are designed to count the number of people Accuracy (%)
who wish to ask and bid and e14 is designed to express the
trend of the stock returns. Fig. 5 Classification accuracy based on the order book
4·2 Classification accuracy based on order book infor-

mation classification accuracy of SVM and J48 based on event-
based representation 2 with variables e1 , e2 , ..., e13 . The
Figure 5 shows the average classification accuracy of
result of “J48 R2” (i.e., 79.3%) clearly shows the classi-
the direction of stock price changes based on the order
fication ability of the order book information. Note that
book information. These results are based on data from
“SVM R2” and “J48 R2” only use the most recent value
the TSE from Jan. 2010 to May 2014. The 100 most fre-
for e1 , e2 , ..., e13 and do not depend on its past history e14 .
quently traded stocks are used in the experiments. The
order book history for each stock is recorded at the time The high classification accuracy of 79%, shown in Fig-
when an order is processed. The values of quoted price, ure 5, requires careful explanation, because this result does
ask/bid size and price are extracted from the data to make not suggest that there is a prediction method capable of
the order history of each stock. Then subsequences of or- achieving a 79% prediction accuracy. The classification
der history, whose length are 10,000, are extracted to make was performed based on information extracted from the
training sets. They are represented with event-based rep- latest UDP packet and possibly previous UDP packets.
resentations 1 and 2. Using each training set, the SVM The output of the learned classifier is the direction of pos-
and J48 models are trained to learn a classification model, sible quoted price changes by the order, i.e., the direction
which solves a three class (i.e., price will go “up,” “down,” of the quoted price changes if the order results in a trade,
or remain “unchanged”) problem. For each training set, or, otherwise, the direction of the quoted price changes in
the subsequent 1000 orders are extracted to construct the future. In other words, the change has already been real-
corresponding test set. Figure 5 shows the average predic- ized once a trade has taken place; therefore, the output is
tion accuracy. just a classification but not a prediction.
ZeroR classifies all the data according to the mode of the Although we will show the true prediction accuracy which
target variable in the training data. Because a few samples can be used in the actual trading shortly, the important
are “unchanged” and about half of the samples are either finding in Figure 5 is “Short-term stock return does not
“up” or “down”, the classification accuracy of ZeroR is walk randomly. The order book can classify stock return
approximately 50% as shown in Figure 5. movement.” We can use Arrownet to classify stock re-
e 14 in Figure 5 shows the average classification accu- turn movements in HFT. Thus, the famous assertion “in-
racy (58.7%) by J48 which uses only e14 as the explana- formational efficiency of financial markets” does not exist
tory variable. We conducted the experiments by varying in HFT. The time lag existing in computers and networks,
the length of e14 to determine whether the classification which is inescapable in HFT, may be a cause of such inef-
accuracy would be affected by the length of the quoted ficiency.
price history. However, we found the prediction accuracy “SVM R1” and “J48 R1” show the average prediction
did not differ much. Thus, we only used the most recent accuracy based on e1 , e2 , , ..., e13 . These results are based
e14 values as the data for the quoted price history. on event-based representation 1. Under the representation,
“SVM R2” and “J48 R2” in Figure 5 show the average the expected output, or the target output of each predic-
tion, is the direction of change of the next quoted price.

Therefore, this setting is close to actual trading. Though Without1 61.9%
the prediction accuracy is only relatively high (60.7% by
Without2 79.2%
SVM and 61.9% by J48), it is higher than the result based
on the history of quoted prices (i.e., 58.7% of e 14 in Fig- With1 62.9%
ure 5). With2 81.1%

Although the results of the preliminary experimentation
shown in Figure 3 only suggest the prediction ability of or- 0 20 40 60
Accuracy (%)
80 100
der book information, the results in Figure 5 clearly con-

firms this. We believe the latter is mainly caused by the
Fig. 6 Improvement of accuracy by using historical data
data representation we used. The results shown in Figure 4
exhibit inability of prediction and are based on the tradi-
tional time series representation while the results shown in
Figure 5 suggest the prediction capability of our approach Rep.1 62.9%
and are based on event-based representation. Training 66.5%
Note that the results obtained with J48 are a slight im- Rep.2 81.1%
provement over those obtained with SVM. SVM models

0 20 40 60 80 100
data by using a smooth function of explanatory variables. Accuracy (%)
Although the training method for parameters of the func-

tion is different, traditional methods of time series analysis Fig. 7 Representation 2 is used for training while Representation 1 is
such as ARIMA also models data by applying a smooth used for test
function. However, J48 achieved better results using a to-
tally different framework, i.e., a classification tree. These on the test set from event-based representation 1. With
results indicate the possible inappropriateness of traditional this setting, the target label is the quoted price change di-
time series analyzing methods (i.e., the use of fixed inter- rection that is realized by the training data. This means
val time series representation and smooth functions) for that the output is just a classification and not a prediction.
the analysis of HFT. See Chapter 5 for further discussion. However, when the model is applied to the test data with
event-based representation 1, because the input, i.e., an
4·3 Improvement of accuracy order, will not necessarily be a fulfilled order, the output
Because the prediction accuracy is important for actual label could be interpreted as a prediction of the direction
stock trading, we have attempted to improve this accuracy of the next quoted price change. Based on this interpreta-
and the results are reported in this chapter. The promising tion, we could use the model in real trading and calculate
results we achieved by using event-based representation the expected prediction accuracy for the situation.
and J48, prompted us to use these two methods as the base The prediction accuracy of a model trained with event-
framework, to aim to improve the prediction accuracy with based representation 2 and tested with event-based repre-
additional means. sentation 1 is 66.5% as shown in Figure 7 next to the label
Figure 6 shows the classification/prediction accuracy of “Training.” This value is in between the values of event-
J48 using both historical data of the quoted price e14 and based representations 1 (Rep.1 in the figure) and 2 (Rep.2
order book information e1 , e2 , ..., e13 . The accuracy that in the figure, Figure 7 Rep.1: 62.9% and Rep.2: 81.1%
was achieved when event-based representations 1 and 2 respectively∗1 ). Although we have not yet identified the
(“With1” and “With2” in the figure) were used with e14 . cause of this improvement (i.e., the difference in accuracy
On the other hand, “Without1” and “Without2” show the between “Rep.1” and “Training”), this phenomenon is in-
accuracy without the use of e14 . From this, we concluded teresting, and requires further study.
that the use of historical data together with order book in- The last experiment we conducted as part of this study
formation leads to a slight improvement of accuracy com- was to examine the number of data sets that would be re-
pared to when the order book information is excluded. quired for training. During the experiments, we noticed
The improved accuracy we achieved with event-based fluctuations in the prediction accuracy. Figure 8 shows the
representation 2, led us to perform additional experiments distribution of the prediction accuracy for a single stock.
to determine the prediction accuracy of J48 model, which ∗1 Figure 7 Rep.1 shows the same value that was shown in Figure 6
is trained with event-based representation 2, but is tested (With1) for comparison purposes.
100 100
90 90
Classification Accuracy (%)
Classification Accuracy (%)

80 80
70 70
60 60
50 50
40 40
0 1000 2000 3000 4000 5000 6000 7000 8000 0 200 400 600 800 1000 1200
Traning Data ID Traning Data ID
Fig. 8 Distribution of prediction accuracy using 10,000 orders for Fig. 10 Distribution of classification accuracy using five day training
training and 1,000 orders for testing data
R1.10k 66.5% All 67.7%
R2.10k 81.1% 60.2%

e_1,e_14
R1.Day 67.7%
e_4,e_14 62.2%
R2.Day 82.9%
e_7,e_14 61.7%
0 20 40 60 80 100
Accuracy (%)
e_10,e_14 59.9%
e_13,e_14 67.7%
Fig. 9 Effect of the size of the training data set
0 20 40 60 80 100
Accuracy (%)
The number of orders for this stock between Jan. 2010
and May 2014 was approximately 77,000,000, and we ex-
Fig. 11 Prediction accuracy with different attributes
tracted 7,700 training sets, each of which consisted of 10,000
orders. Using the subsequent 1,000 orders of this stock,
we calculated the prediction accuracy. Figure 8 plots each 5. Discussion
accuracy and the results vary greatly as can be seen in the
figure. As we suspected this variation to be caused by the Figure 11 compares the prediction accuracy of the mod-
use of a training set of insufficient size, we performed the els trained with one of the attributes e1 , e4 , e7 , e10 , e13 and
same experiment again after increasing its size. e14 . These results compare the explanatory power of order
Figure 9 shows the results of experiments with a larger book information. As shown in Figure 11, the most infor-
size training set. In this experiment, five days of order his- mative attribute is e13 other than e14 . Figure 12 shows an
tories were used as the training data set, and orders placed example of the learned rule during the experiments with
on the next single day were used for the test data set ∗2 . As attributes e13 and e14 . As shown in Figure 12, e13 (i.e.,
shown in Figure 9, the classification accuracy with event- the number of asks and bids with the most recent quoted
based representation 1 increased from 81.1% to 82.9%, price) is sufficient to classify the future stock price change
whereas the prediction accuracy with event-based repre- direction with accuracy 82.9%.
sentation 2 increased from 66.5% to 67.7% and the distri- It is difficult to understand why this simple rule was not
bution of accuracy becomes narrower (See Figure 10). reported before. However, as far as the short-term pre-
diction of HFT is concerned, it is clear that stock returns
∗2 The size of the training/test data sets differed in terms of the do not show a random walk. The institutional investors
stocks and date of trades. The average size of the training/test data
sets for the data shown in Figure 9 were approximately 60,000 and
who have direct access to the trading system may have
6,000 respectively. an unfair advantage over individual investors who do not
J48 pruned tree

------------------
e14 <= -1
| e13 <= -1: n (111.0/24.0)
| e13 > -1: p (878.0/166.0)
e14 > -1
| e13 <= 0: n (845.0/135.0)
| e13 > 0: p (125.0/26.0)
Number of Leaves : 4 68
Size of the tree : 7
Change in Prediction Accuracy (%)

Time taken to build model: 0.14 seconds
Time taken to test model on training data: 0.11 seconds 67.8
=== Error on training data ===
Correctly Classified Instances 1608 82.0827 %
Incorrectly Classified Instances 351 17.9173 % 67.6
Kappa statistic 0.6417
Mean absolute error 0.1956
Root mean squared error 0.3128
Relative absolute error 58.6688 % 67.4
Root relative squared error 76.615 %
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 66.6667 %
Total Number of Instances 1959 67.2
=== Error on test data ===
Correctly Classified Instances 942 81.9843 %
Incorrectly Classified Instances 207 18.0157 % 67
Kappa statistic 0.6393 0 5 10 15 20
Mean absolute error 0.1978 Days
Root mean squared error 0.3145
Relative absolute error 59.3132 %
Root relative squared error 77.014 % Fig. 13 Stability of the learned rule
Coverage of cases (0.95 level) 100 %
Mean rel. region size (0.95 level) 66.6667 %
Total Number of Instances 1149
Fig. 12 An example of learned rules
have such high speed access. Although we did not investi-

gate this phenomenon in other HFT markets, such as retail
foreign exchange trading, the simplicity of the found rule
suggests the generality of our finding. However, this issue
would have to be investigated in future.
Figure 13 shows another possibility of unfairness. We
conducted an experiment on the prediction accuracy of
learned rules. The prediction accuracy of the learned rule
decreases day by day. However, as Figure 13 shows, the
speed of degradation is slow. This stability of the learned
rule facilitates implementation of the system, because in-
vestors who wish to use the trading system with the found
rule can replace the rules every weekend, as J48 requires (1)
less than an hour of learning time. Investors can update
the system during weekends.
The last topics we have to discuss in this study are the
comparison of J48 and SVM. We have conducted various
experiments in which J48 tends to outperform SVM. How-
ever, the simplicity of the hidden rule behind the stock
price changes, as shown in Figure 12, does not explain the
inability of SVM to perform better than J48.
(2) (3)
Figure 14 shows examples of the decision boundary found
by J48 and SVM. As shown in Figure 14, the boundaries Fig. 14 Boundary of J48 and SVM
of SVM and that of J48 appear similar. The capability
of using tilted boundaries (See Figure 14 (2)) or smooth
nonlinear boundaries (Figure 14 (3))) would suggest that
SVM has an advantage. However, the results suggest that
the opposite is true.
Our assumption for the reason for this phenomenon is ♦ References ♦

the difference between the penalty functions that J48 and
[Basu 77] Basu, S.: Investment performance of common stocks in re-
SVM use for learning. The soft-margin formulation of lation to their price-earnings ratios: A test of the efficient market hy-
SVM uses the distance |t − (wT φ(x) + b)| as the penalty pothesis, The Journal of Finance, Vol. 32, No. 3, pp. 663–682 (1977)
[Beltran-Lopez 12] Beltran-Lopez, H., Grammig, J., and
for each miss-classified data. The penalty used by J48
Menkveld, A. J.: Limit order books and trade informativeness,
is simpler, i.e., entropy or information gain. The eval- The European Journal of Finance, Vol. 18, No. 9, pp. 737–759
uation measure for our experiments, i.e., the classifica- (2012)
[Bollen 11] Bollen, J., Mao, H., and Zeng, X.: Twitter mood predicts
tion/prediction accuracy is the ratio of the number of sam- the stock market, Journal of Computational Science, Vol. 2, No. 1,
ples that are correctly classified/predicted versus those that pp. 1–8 (2011)
[Box 13] Box, G. E., Jenkins, G. M., and Reinsel, G. C.: Time Series
are not. The ratio is not dependent on how far from the Analysis: Forecasting and Control, John Wiley & Sons (2013)
boundary the miss-classified samples reside. Therefore, [Champernowne 48] Champernowne, D.: Sampling theory applied to
autoregressive sequences, Journal of the Royal Statistical Society. Se-
the penalty function used by J48 seems to be more ade-
ries B (Methodological), Vol. 10, No. 2, pp. 204–242 (1948)
quate than that of SVM to express the errors for the pre- [Chang 11] Chang, C.-C. and Lin, C.-J.: LIBSVM: A library for sup-
diction in the change of the stock price. Because a similar port vector machines, ACM Transactions on Intelligent Systems and
Technology, Vol. 2, pp. 27:1–27:27 (2011), Software available at
distance, such as (t − (wT φ(x) + b))2 , is used in most AR http://www.csie.ntu.edu.tw/ cjlin/libsvm
models and their descendants, they may suffer from a sim- [Chlistalla 11] Chlistalla, M., Speyer, B., Kaiser, S., and Mayer, T.:
High-frequency trading, Deutsche Bank Research, pp. 1–19 (2011)
ilar problem. [Chordia 04] Chordia, T. and Subrahmanyam, A.: Order imbalance
and individual stock returns: Theory and evidence, Journal of Finan-
cial Economics, Vol. 72, No. 3, pp. 485–518 (2004)
6. Conclusion [Danielsson 12] Danielsson, J. and Payne, R.: Liquidity determina-
tion in an order-driven market, The European Journal of Finance,
In this study, we analyzed order book information of Vol. 18, No. 9, pp. 799–821 (2012)
[Deng 14] Deng, S., Mitsubuchi, T., and Sakurai, A.:
the TSE (Tokyo Stock Exchange, the third largest stock Stock Price Change Rate Prediction by Utilizing So-
exchange in the world) from Jan. 2010 to May 2014, cial Network Activities., The Scientific World Journal
(2014), Article ID 861641, doi:10.1155/2014/861641 at
and showed that a simple rule is capable of classifying http://www.hindawi.com/journals/tswj/2014/861641/
the change in direction of short-term stock prices with an [Deng 15] Deng, S., Yoshiyama, K., Mitsubuchi, T., and Sakurai, A.:
Hybrid method of multiple kernel learning and genetic algorithm for
82.9% accuracy and of predicting it with an accuracy of
forecasting short-term foreign exchange rates, Computational Eco-
67.7% by using the order book information of HFT (High- nomics, Vol. 45, No. 1, pp. 49–89 (2015)
Frequency Trading). [Gallagher 02] Gallagher, L. A. and Taylor, M. P.: Permanent and
temporary components of stock prices: Evidence from assessing
As the volume of HFT, which forms short-term stock macroeconomic shocks, Southern Economic Journal, pp. 345–362
exchanges, is increasing dramatically, our study suggests (2002)
[Gay Jr 11] Gay Jr, R. D., et al.: Effect of macroeconomic variables
the importance of short-time market analyses that are not on stock market returns for four emerging economies: Brazil, Rus-
well studied in conventional financial market theories. sia, India, and China, International Business & Economics Research
Journal (IBER), Vol. 7, No. 3 (2011)
The experimental results also suggest the importance of [Gilbert 10] Gilbert, E. and Karahalios, K.: Widespread Worry and
new data representation, i.e., event-based representation, the Stock Market, in ICWSM, pp. 59–65 (2010)
and the analysis methods that are not well studied in con- [Hall 09] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute-
mann, P., and Witten, I. H.: The WEKA Data Mining Software: An
ventional financial studies. The use of the penalty func- Update, SIGKDD Explorations, Vol. 11, pp. 10–18 (2009), [accessed
tion to determine optimal model parameters is important 19-Aug-2014]
[Hsu 10] Hsu, P.-H., Hsu, Y.-C., and Kuan, C.-M.: Testing the pre-
to identify the model with optimal performance. dictive ability of technical analysis using a new stepwise test without
Although the prediction accuracy is less than the clas- data snooping bias, Journal of Empirical Finance, Vol. 17, No. 3, pp.
471–484 (2010)
sification accuracy, it is still 67.7% and can be applied to [Huang 94] Huang, R. D. and Stoll, H. R.: Market microstructure and
actual trading. This level of accuracy may suggest that in- stock return predictions, Review of Financial Studies, Vol. 7, No. 1,
stitutional investors who can access the trading system di- pp. 179–213 (1994)
[Lewis 14] Lewis, M.: Flash Boys, W.W.Norton & Company, Inc.
rectly have an unfair advantage over individual investors (2014), ISBN 978-0-393-24467-0
who do not have such high-speed access. [Malkiel 99] Malkiel, B. G.: A Random Walk Down Wall Street: In-
cluding a Life-cycle Guide to Personal Investing, W.W. Norton &
Company (1999)
Acknowledgments [Menkhoff 10] Menkhoff, L.: The use of technical analysis by fund
managers: International evidence, Journal of Banking & Finance,
This work was supported in part by JSPS KAKENHI Vol. 34, No. 11, pp. 2573–2586 (2010)
Grant Number 25280114 and 25330266. [Patterson 13] Patterson, S.: Dark Pools: The Rise of the Machine
Traders and the Rigging of the U.S. Stock Market, Crown Business
(2013), ISBN 978-0-307-88719-1
[Sprenger 13] Sprenger, T. O., Tumasjan, A., Sandner, P. G., and
Welpe, I. M.: Tweets and trades: The information content of stock

microblogs, European Financial Management, Wiley (2013)
[Sul 14] Sul, H., Dennis, A. R., and Yuan, L. I.: Trading on Twit-
ter: The Financial Information Content of Emotion in Social Media,
in 2014 47th Hawaii International Conference on System Sciences
(HICSS), pp. 806–815 IEEE (2014)
[Tokyo stock exchange 12] Tokyo stock exchange, : arrownet (2012),
[accessed 19-Aug-2014]
〔担当委員：和泉潔〕
Received July 27, 2015.
Author’s Profile
Kenichi Yoshida (Member)

received the Ph.D. in engineering from Osaka University in
1992. In 1980, he joined Hitachi Ltd. Since 2002, he has
been a Professor at University of Tsukuba, Tokyo, Japan.
His current research interest includes application of Internet
and application of machine learning techniques.
Akito Sakurai (Member)

graduated from the University of Tokyo in 1975, completed
his master’s course in 1977, and joined Hitachi, Ltd. In 1978
he earned an M.S. degree in Computer Science at the Uni-
versity of Illinois. In 1989 he moved the Advanced Research
Laboratory, Hitachi, and in 1996 the Central Research Labo-
ratory. In 1998 he became a professor at the Japan Advanced
Institute of Science and Technology. He has been a profes-
sor at Keio University since 2002. He holds D.Eng. degree
(The University of Tokyo, 1993).

Short-Term Stock Price Analysis Based On Order Book Information

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Short-Term Stock Price Analysis Based On Order Book Information

Uploaded by

Copyright:

Available Formats

683

Short-term Stock Price Analysis Based on Order Book

1. Introduction financial studies. However, the results were not pleasing.

of information asymmetry. These studies analyze the in- UDP packet

In our study, we aim to find situations in which infor-

As the amount of HFT is increasing, there has been a

(1) Arrownet (Real World Trade Status) 0.8

Fig. 3 Correlation coefficients between AV Eam and returns

mation of events. In addition, in most cases, it does not throughout morning

Squared Correlation Coefficient

Although the non-existence of a correlation between 0.01

AV Eam and the afternoon returns suggests the impossi-

same price as the most recent quoted price after the

Differences between the recent quoted price J48_R1 61.9%

the values up to the four most recent differences.

4·2 Classification accuracy based on order book infor-

tion, is the direction of change of the next quoted price.

ure 5). With2 81.1%

der book information, the results in Figure 5 clearly con-

and are based on event-based representation. Training 66.5%

provement over those obtained with SVM. SVM models

Although the training method for parameters of the func-

Classification Accuracy (%)

R1.10k 66.5% All 67.7%

R2.10k 81.1% 60.2%

J48 pruned tree

Change in Prediction Accuracy (%)

Fig. 12 An example of learned rules

have such high speed access. Although we did not investi-

Our assumption for the reason for this phenomenon is ♦ References ♦

Welpe, I. M.: Tweets and trades: The information content of stock

Received July 27, 2015.

Kenichi Yoshida (Member)

Akito Sakurai (Member)

You might also like