You are on page 1of 3
Exper sjtems With application 22015) 9603-9511 Contents lists available at ScienceDirect Expert Systems With Applications ar journal homepage: vow slsevier convlocateloswa Sentiment analysis on social media for stock movement prediction Deen Thien Hai Nguyen*, Kiyoaki Shirai’, Julien Velcin? sel of main Sec. Joa Advonced ste fence an Tec 11 Aa No, haw 923-1252 pa esto on (RG or 2] 5 avenue Mere Mendes Pane 59676 on Cex, ance ARTICLE INFO ABSTRACT ‘pw Sentiment anass “he gal ofthis research to bulla modelo predic stock pice movement ung the sentiment fom sci ‘rea Une previous approaches where the overall moods or sentiment ae corsidre, the sentiment Orin {a the speci topes ofthe company ae incerporated Into the stock peton model Wopk an elated Seeateten Sentiments are automaticly extracted fom the tex In 3 message board by sng ou proposed method 3 oe ‘ella existingtopicmodel nado. this paper shows an evaluation ofthe eflectveness the sentiment Sci metis analysis inthe stock prediction task va a lage scale experiment. Comparing the accuracy average ver 18 esag bad ‘Socks in one yar transaction, our method achieved 2.07 better performance than the model using historical ‘Prices only. Furthermore, when comparing the methods only forthe stocks that are dificul to pred, our ‘method achieved 983% better accuracy than historical rie method, and 3.03% better than human sentiment tethod (©2015 Elsevier L.A ights reserved 1. Introduction ‘media data. Then, these sentiments willbe integrated into a model to Stock price forecasting is very important in the planning of busi ness activity. However, building an accurate stock prediction model {sstilla challenging problem. In addition to historical prices, the eur- rent tock markets affected by the mood of society. The overall social 'mood with respect oa given company might be one of the important variables which affect the stock price ofthat company: Nowadays, the ‘emergence of online socal networks makes available large amounts ‘of mood data, Therefore, incorporating information from social me dia with the historical prices can improve the predictive ability of ‘models. The goal of our research is to develop a model to predict the stock price movement (whether the price will be up or dawn) us- ing information from social media (Message Board). In our proposed ‘method, a model that predicts the stock value at ¢ using features de~ rived from information at ¢~ 1 and ¢~ 2, where stands fora tans- action date, will be trained by supervised machine learning, Apart from the mood information, the stock prices are affected by many factors such as mieroeconomic and macroeconomic factors. However this researc only focuses on how the mood information from so- 1 media can be used to predict the stock price. We will mainly aim at extracting the mood information by sentiment analysis on social mal aldeses: nchenSx@gmalcom, bhietastacjp stirs pK Shia) en veeinbun-yon2 Vein, TH Newent upftoion 016pea.201507052 1957416j0 2015 Geir dl All gs sesered predict stocks. To achieve this goal, discovering the topics and senti- ‘ments in a large amount of social media is very important to get the opinions of investors. However, sentiment analysis on social media is dificult. The text is usualy short, contains many misspellings, un- ‘common grammar constructions and s0 on In ation, the literature shows conflicting results in sentiment analysis for stock market pre- diction, Some researchers report that sentiments from social media have no predictive capabilities (Antweiler& Frank, 2004; Tumarkin & Whitela 2001), while other researchers have reported either weak or strong predictive capabilites (Bollen, Mao, & Zeng, 2011). There fore, how to use opinions in social media for stock pre predictions is still an open problem, ‘One contribution ofthis paper is that we propose a novel feature ‘topic-sentiment'to improve the performance of stock market predic- tion. Itis important to recognize what topies are discussed in social ‘media and how people feel about these topics. The ‘topic-sentiment” feature, which represents the sentiments of the specific topics ofthe company (product, service, dividend and so on), are used for predic- tion of stock price movement. This feature is obtained in two ways: by using the existing topic model called the joint sentimentjtopic ‘model JST) and by our own proposed method. The extracted topics and sentiments in the former method are hidden (latent), whereas rot hidden in the latter. To the best of our knowledge, this is the first research trying to extract topics and sentiments simultaneously and utilize them for stock market prediction, Another contribution js a lage scale evaluation. The effectiveness of the sentiments in social meaia in stock market prediction is still uncertain because a sou "TH, Ngwen et axe Systems Wh Aplin 422015) 9602-961 relatively small data was used for evaluation in the previous work. ‘This paper investigates whether the sentiments in the social media are really useful onthe test data containing many stocks and transac- tion dates ‘The rest ofthe paper i organized a follows. Section 2 introduces some previous approaches on sentiment analysis for stock prediction, Section 3 describes our dataset. Section 4 describes our proposed ‘method. We also propose a novel feature for stock prediction based ‘on the topics and the sentiments associated with them. Section 5 as- sesses the result of the experiments. Finally, Section 6 concludes our contribution. 2. Related work Stock market prediction is one ofthe most attracted topics in aca- demic as well as rel life business. Many researches have tried to ad dress the question whether the stock market can be predicted, Some ofthe researches were based on the random walk theory and the EF- ficient Market Hypothesis (EMH). According tothe EMH (Fama, 1991; Fama, Fisher, Jensen, & Rol, 1969), the current stock market fully re- fects all available information. Hence, price changes are merely due to new information or news. Because news in nature happens ran- domly and is unknowable in the present stock prices should follow a random walk pattern and the best bet for the next price is eut- rent price. Therefore, they are not predictable with more than about 50% accuracy (Walczak, 2001)-On the other hand, various researches specify that the stock market prices do not follow a random walk, and can be predicted at some degree (Bollen etal, 2011; Qian & Rasheed, 2007; Vu, Chang, Ha, & Collier, 2012). Degrees of directional accuracy at 36% hit rate in the predictions are often reported as satisfying re- sults for stock predictions (Schumaker & Chen, 2009b; Si, Mukherjee, Liu, Li, Li, & Deng, 2013; Tsibouris &Zeidenbers, 1995). Besides the efficent market hypothesis andthe random walk the- ores, there are two distinct trading philosophies for stock market prediction; fundamental analysis and technical analysis. The funda ‘mental analysis studies the company’s financial conditions, opera tions, macroeconomic indicators to predict stock price. On the other hhand, the technical analysis depends on historical and time-series Dries. Price moves in trends, and history tends to repeat itself Some researches have tried to use only historical prices to predict the stock price (Cervellé-Royo, Guljaro, & Michniuk, 2015; Patel, Shah, ‘Thakkar, & Kotecha, 2015a, 2015b; Ticknor, 2013: Zuo & Kita, 2012a, 2012b). To discover the pattern in the data, they used Bayesian net- work (Zuo & Kita, 20122, 2012b), time-series method such as Auto Regressive model, Moving Average model (Pate tal, 2015a, 2015), Auto Regressive Moving Average model (Zuo & Kita, 20128) and While these previous methods did not consider the sentiments on the social media, in this paper our work aims at incorporating them toimprove the performance ofthe stock market prediction. Most ofthe research tried to predict only one stock (Bollen etal, 2011; Qian & Rasheed, 2007; Si et al, 2013) and the number of in- stances (transaction dates) in a test se is very low such as 14 oF 15, instances (Bollen etal, 2011; Vu etal, 2012). With only afew in- stances in the test set, the conclusion might be insuficient. To the best of our knowledge, there is no research showing a good predic- tion result on a data consisting of many stocks ina long time period. ur research ried to solve this issue by predicting 18 stocks over a period of one year 2.1. Use of opinions from text for stock market prediction Sentiment analysis has been found to play a significant role in ‘many applications such as product reviews and restaurant reviews (Liu & Zhang, 2012; Pang & Lee, 2008), There are some researches tuying to apply sentiment analysis on an information source to im- prove the stock prediction model (Nassirtouss, Azhabozorgi, Wah, & Ngo, 2014). There are two main sources from which authors have i corporated information aggregated from textual content into finan- ‘dal models. In the past, the main source was the news (Schumaker & ‘Chen, 2009a, 20096), and in recent years, social media sources. Then, these sentiments are integrated int prediction models. A simple ap- proach is combining the textual content with the historical prices through the linea regression model. ‘Most of the previous work primarily used the bag-of-words as text representation that are incorporated into the prediction model ‘Schumaker andl Chen (2009) tried to use different textual represen- tations such as bag-of-words, noun phrases and named entities for financial news. Then this information was integrated with linear r= _gession and support vector machine regression as predictive models. ‘They applied their models to estimate a discrete stock price 20 min, after a news article was released, The results show 0.04261 mean Square error, $71% directional accuracy, and 2.08% return ina sim- ulated trading engine. However, the textual representations are just the words or named entity tags, not exploiting so much about the ‘mood information “Antweiler and Frank (2004) used naive Bayes to classify the mes- sages from message boards into three classes: buy, hold and sell. The number of relevant messages in these three classes was aggregated into single measure of bulishness. They investigated three aggrega- tion functions as a numberof alternatives to bullishness, They were integrated into the regression model, However, they coneluded that their model does not successfully predict stock returns. Zhang, Fushres, and Gloor (2011) measured collective hope and fear on each day and analyzed the correlation between these indices and the stock market indicators. They used the mood words to tag. ‘each tweet as fear, worry, hope and so on. They concluded thatthe ‘emotional tweet percentage significantly negatively corelated with Dovin Jones, NASDAQ and S&P 500, but had significant positive cor- relation to VIX. However, they did not use their model to predict the stock price values Two mood tracking tools, OpinionFinder and Google Profile of Mood States, were used to analyze the text content of daily Twit- ter (Bollen et al, 2011). The former measures positive and negative ‘mood. The later measures mood in terms of six dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). They used the Self Organizing Fuzzy Neural Network model to predict DJIA values. The results show 86.7% direction accuracy (up or down), Mean Absolute Percentage EF- ror 1.79%, However, their test period is very short (Irom December 1to December 19, 2008). Even though, they achieved high accuracy there are only 15 transaction dates in their test set With such a short Period, it might not be sufficient to conclude the effectiveness oftheir method. Xie, Passonneau, Wu, and Creamer (2013) proposed a novel tree representation based on semantic frame parsers. They indicated that this representation performed signficantiy beter than bag-of- words By using stock prices from Yahoo Finance, they annotated al the news ‘with labels na transaction date as going up or Gown categories. How ‘ever, the weakness of this assumption is that all the news in one day ‘will have the same category. In addition, this becomes a document Classification problem, not stock prediction. Rechenthin, Street, and Srinivasan (2013) incorporated Yahoo nance Message Board into the stock movement prediction. They tied to use various classification models to predict stock. They used the ‘explicit sentiments and predicted sentiments obtained by a classi ‘ation model withthe bag-of-words and meta-features. ‘A keyword-based algorithm was proposed ta identify the senti- iment of tweets as positive, neutral and negative for stock prediction (Vu etal, 2012), Their model achieved around 75% accuracy. How ‘ever, thei test period is very shor, from 8th to 26th in September, 2012 which contains only 4 transaction dates. TH Nguyen eal ner Systems Wt plein 22015) 903-361 ss Siet al, (2013) developed a non-parametric topic model for Twit- ‘er messages to predict the stock market. They proposed a continuous Dirichlet Process Mixture (cDPM) model to learn the daly topic st. Then, a sentiment time series was built based on these topics, The advantage of this method is thatthe model estimates the number of| topics inherent in the data itself. However, the time period oftheir dataset is quite short, only three months ‘Aseries of the previous work discussed in ths subsection tried to extract the overall opinion or sentiment of the document. However, the opinions are often expressed toward the topics or aspects. For the prediction of stock prices, its important to know on which topics of the company people have positive or negative opinions. In our pro- ‘posed method, the sentiments ofthe topics or aspects are extracted, then they are incorporated into the stock prediction models. "Next subsection will discuss some related work forthe identfica- sion of aspect-oriented sentiments 22, Aspect based sentiment analysis ‘There are some researches trying to extract both topic and sen- ‘iment for other domains such as online product review, restau rant review and movie review dataset (Detmouche, Kouas, Velein, & Loudcher, 2015) Jo and Oh (2011) proposed ASUM model for extract- ing both aspect and sentiment for online product review dataset. The ‘made! assumes tha all words within a sentence are generated from fone topic. ‘The joint sentiment/topic motel (JST was proposed to detect sen- timent and topic simultaneously for movie review dataset (Lin & He, 2009), This model assumes that each word is generated from a Joint topic and sentiment distribution, Because this model can ex- tract topic and sentiment simultaneously, we will use it to extract {opic-sentiment features. laklaraju, Bhattacharyya, Bhattacharya, and Merugu (2011) pro- posed the FACTS, CFACTS, FACTS-R, and CFACTS-R model fo perform sentiment analysis on a product review data. These models assume word fallen into three categories: facet word, sentiment word or other category (background word, stop word, function word, et) Based on its category, a word is generated from corresponding facet, sentiment or background distribution. In addition, they introduced 4 window, which is a contiguous sequence of words. All facet words Wwithin a window are assumed to be derived from the same facet topic, and all sentiment words from the same sentiment topic. ‘Zhao, Jang, Yan, and Li (2010) proposed the MaxEnt-LDA hybrid ‘model to jointly discover both aspects and aspect-specific opinion words on a restaurant review dataset. Besides the general opinion Words, they considered the aspect-specifc opinion words. Therefore, word is fallen into five categories: background, specific aspect, gen- eral aspect, specific opinion and general opinion. Based on these cat- cegories, words are generated from corresponding distributions. The above methods tried to extract the hidden (latent) toplc- sentiment associations. In our proposed method, both hidden and ‘non-hidden topic-sentiment are considered by using JST topic model and proposed algorithms that will be discussed in Sections 45 and 4.6, respectively 3. Dataset We used two datasets for our stock prediction model. The first one Js the historical price dataset, and the second one is the mood infor- ‘mation dataset. 34, Historical prices Historical prices are extracted from Yahoo Finance for the 18 stocks. The list of stock quotes and company names is showin in the ‘Table 1. For each transaction date, there are open, high, low, close (Quotes 2 company names ‘Sucks ___Compaayraes mar ‘once te anket ane Cop xD Goyer in. De ——_—Daline EaY stayin 006 Gane ne 18 Inertial Buses Machines Corporation "0 The Coorcola Company SET Mies Comation ORC rae Cuporaton I ‘tarine and adjusted close prices. The adjusted close prices are the close prices which are adjusted for dividends and splits. The adjusted close price is often used for stock market prediction as in other researches (Rechenthin et al, 2013). Therefore, we chose it asthe stock price Value for each transaction date, 32, Message board dataset ‘To get the mood information of the stocks, we collected the 18, ‘message boards of the 18 stocks from Yahoo Finance Message Board for a period of one year (from July 23, 2012 to July 19, 2013). On the message boards, users usually discuss company news, prediction about stock going up or down, facts, comments (usualy negative) about specific company executives or company events. In 15.6% mes- sages in this dataset, when users posted messages on these message boards, they annotated each message as one ofthe following senti- ‘ment tags: Song Buy, Buy, Hold, Sell and Strong Sell. There are two kinds of messages. The fist one is the messages created by starting a new topic. The other is reply messages to existing messages. Most ‘of users’ posts are reply messages. They form a complicated commu nication network, In our research, however, we treated all messages independent from each other. Fig. shows an example message from AAPL Message Boor. In this message, on July 6, 2012 a username "Keepshorting” posted the ‘message "Looks like the competition is heating up. $199 tablet, what is next? $999 laptops and then $499 laptops? the margins are im- possible to keep up. impossible folk.” to reply to another message of another user. In adtion, this user selected the sentiment for this stock as Strong Sell” ‘The stock market is not opened at the weekend and holiday. To assign the messages to the transaction dates, the messages which ‘were posted from 4 pm of the previous transaction date to 4 pm of the current transaction date will belong to the current transaction. We choose 4 pm because that is the time of losing transaction, There are 249 transaction dates from the one year period of our dataset. ‘Table 2 summarizes the statistics of our dataset for each transaction date about the min, median, mean, max of the number of messages and the mean ofthe number ofthe existing sentiments annotated by ‘Some previous works used Twitter as the mood information source for sentiment analysis related toa particular stock. There are "Te AAPL message boar asthe highest number of messages Because of the in iain on the nurber a web pages we ony cl fora peiod teen neath

You might also like